# Large Language Modelling: Mining CX from Social Media

Technologies Used:

![PyTorch](https://img.shields.io/badge/-PyTorch-EE4C2C?logo=pytorch&logoColor=white)
![Hugging Face](https://img.shields.io/badge/-Hugging%20Face-FFAC45?logo=huggingface&logoColor=white)
![Sentence Transformers](https://img.shields.io/badge/-Sentence%20Transformers-00B888?logo=pytorch&logoColor=white)
![Altair](https://img.shields.io/badge/-Altair-7F7FFF?logo=altair&logoColor=white)
![BERTopic](https://img.shields.io/badge/-BERTopic-FFAC45?logo=huggingface&logoColor=white)
![Tweet Preprocessor](https://img.shields.io/badge/-Tweet%20Preprocessor-00B888?logo=python&logoColor=white)
![UMAP](https://img.shields.io/badge/-UMAP-7F7FFF?logo=python&logoColor=white)
![Python](https://img.shields.io/badge/-Python-3776AB?logo=python&logoColor=white)

# 1. BERT: Bidirectional Encoder Representations from Transformers

#### **Let's work with a MLM called BERT!**

## 1.1 Quick Recap
![From NLP to NLU](https://mapXP.app/MBA742/BERT1.png "BERT Explained 1")

![From NLP to NLU](https://mapXP.app/MBA742/BERT2.png "BERT Explained 2")

![From NLP to NLU](https://mapXP.app/MBA742/BERT4.png "BERT Explained 4")

![From NLP to NLU](https://mapXP.app/MBA742/BERT5.png "BERT Explained 5")

![From NLP to NLU](https://mapXP.app/MBA742/BERT6.png "BERT Explained 6")

![From NLP to NLU](https://mapXP.app/MBA742/BERT7.png "BERT Explained 7")

![From NLP to NLU](https://mapXP.app/MBA742/BERT8.png "BERT Explained 8")

![From NLP to NLU](https://mapXP.app/MBA742/BERT9.png "BERT Explained 9")

## 1.2 Sentence Embedding with SentenceTransformers

> SBERT Special version of BERT that was trained for Sentence Similarity

![SBERT](https://www.sbert.net/_static/logo.png "Sentence BERT")

- SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings.
  * Check out the website with code examples: https://www.sbert.net/index.html
  * Here is the GitHib repository: https://github.com/UKPLab/sentence-transformers  
  * Read the research article for details: https://arxiv.org/abs/1908.10084


- You can use this framework to compute sentence / text embeddings for more than 100 languages.
- Sentence embeddings can be compared with cosine-similarity to find sentences with a similar meaning.
  * Semantic search
  * Paraphrase mining
  * Clustering
  * Visualization in maps

The framework is based on PyTorch and Transformers
   * Large collection of pre-trained models tuned for various tasks.
   * Easy to fine-tune your own models.

***That's exactly what we need to analyze entire tweets so that we can discover topics on X(Twitter)***

## 1.3 Install Sentence Transformers (SBERT)

- You can easily download and install it on CoLab or your own computer using pip install
  *
```
!pip install -U sentence-transformers
```

- To install SBERT on your Apple Computer with Apple Silicone (M1 and M2 chips), I recommend that you use Conda:

```
conda install -c conda-forge sentence-transformers
```

- We will also use **PyTorch**
 * Open source machine learning framework that accelerates the path from research prototyping to production deployment https://pytorch.org/
 * **PyTorch is already installed on CoLab (Torch)**
 * To install PyTorch on your computer, visit https://pytorch.org/get-started/locally/
![PyTorch](https://www.mapXP.app/MBA742/Pytorch_logo.jpg "PyTorch")

### ***GPU Support on CoLab***
- The code of this notebook will run on CPUs
- To make things faster, we can leverage GPUs
- CoLab grants us free access to GPUs
  - Click on **"Runtime"** in the menubar
  - Click on **"Change Runtime type"** in the dropdown
  - Select **"GPU"** as Hardware accelerator
  - Click **"Save"** button


In [None]:
# 1. Install SentenceTransformers (SBERT) if it is not already installed
#!pip install -U sentence-transformers
#!pip install --upgrade tensorflow

## 1.4 Download a Pre-Trained SBERT Model

Now you need to
1. import SBERT, and
2. download a pre-trained model.
  * **Models can be very large, i.e., over 1GB of data!**  
  * There are over 4,000 pre-trained models available:
    * https://www.sbert.net/docs/pretrained_models.html
    * https://huggingface.co/models?library=sentence-transformers&p=3&sort=downloads
  * Pre-trained models are for different languages or topics such as
    * Patents (PatentSBERT)
    * Medical Claims and Fake News (BioBERT)
    * English-German Cross-Language (Cross En-De RoBERTa)

In [None]:
# 1. Import required libraries
from sentence_transformers import SentenceTransformer, util
import torch

# 2. Load a pre-trained SBERT model (this one is rather small and has "only" 384 dimensions)
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# OPTIONAL: If you are runniung a macbook with an M chip that incoproates a GPU, then you can leverage that GPU when you load the pre-trained SBERT model as follows:
# embedder = SentenceTransformer('all-MiniLM-L6-v2', device='mps')

## 1.5 Embed Sentences with SBERT

In [None]:
# 1. Let's create a word, sentence and even paragraph:
words = "Artificial Intelligence"

sentence = "AI is transforming buisness in remarkable ways!"

paragraph = "Artificial intelligence is a rapidly advancing field that integrates concepts from computer science, " \
               "mathematics, and cognitive psychology to create systems capable of performing tasks traditionally " \
               "requiring human intelligence. These systems, powered by machine learning and deep learning techniques, " \
               "are now driving innovations in areas such as healthcare, finance, transportation, and entertainment. " \
               "By enabling machines to learn from data, recognize patterns, and make decisions, AI has become a " \
               "cornerstone of modern technology, offering transformative potential while also raising important " \
               "ethical and societal questions about its implications for the future."

corpus = [words, sentence, paragraph]

# 2. Now we can easily embed them with SBERT
if torch.cuda.is_available()==True:
  print("Embedding on GPU\n")
corpus_embeddings = embedder.encode(corpus, batch_size=64, show_progress_bar=True, convert_to_tensor=True)

# 3. Move embeddings from GPU to CPU IF you are using a GPU!
if torch.cuda.is_available()==True:
  print("Moving Embeddings from GPU to CPU\n")
  corpus_embeddings=corpus_embeddings.cpu()

# 4. Let's look at text_embeddings and see that they are vectors, that is, latent feature vectors
for text, vector in zip(corpus, corpus_embeddings):
  print(f"Text: {text}")
  print(f"Embedding size: {len(vector)}")
  print(f"Embedding: [{', '.join(map(str, vector[:5].numpy())) +',...'}]\n")

***Question: What do you notice?***

## 1.6 Sentence Similarity
We can use latent feature vectors to determine how similar sentences are!

- The output of SBERT is a matrix of dimension N*384 (for the model we used!)
- Each sentence of N sentences is a feature vector of size 384
- When the vectors are normalized (which is the case for the pre-trained model), the inner product of encodings can be treated as a similarity matrix

In [None]:
# 1. Write several sentences of different topics for restaurants

corpus = [
    # Good Service
    "The waiter at the restaurant was very nice",
    "The restaurant had great service",
    "The service is great because of the nice waiters",

    # Good Food
    "Very flavorful chicken!",
    "I love the taste of the food.",
    "They make yummy food!",

    # Good Ambience
    "The interior is amazing.",
    "I like the way it looks inside.",
    "The ambience of the place is wonderful.",

    # Food Delivery
    "They deliver all orders to your door.",
    "You can order all items for delivery.",
    "They delivered the wrong items!"
]
for item in corpus:
    print(item)

In [None]:
# 2. Embed sentences with SBERT
corpus_embeddings = embedder.encode(corpus, batch_size=64, show_progress_bar=True, convert_to_tensor=True)

# 3. Move embeddings from GPU to CPU
if torch.cuda.is_available()==True:
  print("Moving Embeddings from GPU to CPU\n")
  corpus_embeddings=corpus_embeddings.cpu()

In [None]:
# 3. Look at an embedded sentence
corpus_embeddings[0]

In [None]:
# Optional: Normalize embeddings to 1 if not already done by pre-trained model
# corpus_embeddings = util.normalize_embeddings(corpus_embeddings)

In [None]:
# 7. Generate a Similarity Matrix of Embeddings
import numpy as np
sim_matrix = np.inner(corpus_embeddings, corpus_embeddings)
print(sim_matrix[0:5,0:5])

***What do you notice about the matrix above?***

## 1.7 Visualizing Sentence Similarity
Let's generate a heatmap to see to what extent the vectors of sentences that refer to similar topics are also similar

In [None]:
# 1. Truncate sentences to create labels
corpuslabels = [elem[:30] for elem in corpus]
for item in corpuslabels:
    print(item)

In [None]:
# 2. Import needed packages
import seaborn as sns

# 3. Let's visualize the similarities in a heatmap to test whether we can discover topics
# Define a function that creates a heatmap for sentence similarity
def plot_similarity(labels, sim_, rotation=90):
  sns.set(rc = {'figure.figsize':(10,8)}, font_scale=1.5)
  g = sns.heatmap(sim_,
      xticklabels=labels, yticklabels=labels,
      vmin=0, vmax=1, cmap="YlOrRd")
  g.set_xticklabels(labels, rotation=rotation)
  g.set_title("Semantic Textual Similarity")

# 4. Call the function to show the heatmap
plot_similarity(corpuslabels, sim_matrix, 90)

## 1.8 Topic Discovery with Cluster Analysis

### Cluster embedded vectors using k-Means

In [None]:
# 1. Import package
from sklearn.cluster import KMeans

# 2. Initializing KMeans
kmeans = KMeans(n_clusters=4, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 42)

# 3. Fitting with inputs
kmeans = kmeans.fit(corpus_embeddings)

# 4. Predicting the clusters
labels = kmeans.predict(corpus_embeddings)
print(labels)

## 1.9 Explore Sentence Similarity in an Interactive Map

We can visualize the relationships between our 12 sentences in a map using t-SNE:

1. Reduce the dimensionality of the vectors from 384 to 2 with t-SNE
2. Visualize the similarity of sentences in a scatterplot.

**Note:** To save time in class, we will not run t-SNE multiple times with different seeds to find a better local optimum. For practice, you should however run t-SNE more than once with different seeds (i.e., random states) and pick the solution with the lowest cost!

In [None]:
# 1. Import packages
from sklearn.manifold import TSNE

# 2. Instantitate and fit t-SNE, giving array of x,y coordinates
X_tsne = TSNE(n_components=2, verbose=1, perplexity=5, max_iter=1000, learning_rate=50, init='random', random_state=42
              ).fit_transform(corpus_embeddings)

**We will use Altair to create an interactive Map**

Altair is a powerful tool for interactive visualization in Python https://altair-viz.github.io/index.html

In [None]:
# 1. Import Altair
import altair as alt

# 2. Create a new DataFrame that holds all the information we need for our map
import pandas as pd
source = pd.DataFrame(
    {'x': X_tsne[:, 0],
     'y': X_tsne[:, 1],
     'txt': corpus,
     'Topic' : labels
     #'size'  : 10
    })

# 3. Define Bubbles on Map
bubbles = alt.Chart(source).mark_circle(size=400).encode(
    x=alt.X('x:Q', axis=alt.Axis(title="not directly interpretable", grid=False, labels=False),scale=alt.Scale(domain=[min(source.x)-1, max(source.x)+1])),
    y=alt.Y('y:Q', axis=alt.Axis(title="not directly interpretable", grid=False, labels=False),scale=alt.Scale(domain=[min(source.y)-1, max(source.y)+1])),
    #size='size',
    color = 'Topic:N',
    tooltip=[alt.Tooltip('txt', title='Tweet'),                            # We can include a lot of information in the tooltips (mouseover pop-up)
             alt.Tooltip('Topic', title='Topic')
            ]
)

# 4. Define Labels next to Bubbles on Map
text = alt.Chart(source).mark_text(
    align='left',
    baseline='middle',
    dx=10 # offset label in x coordinate
).encode(
    x='x:Q',
    y='y:Q',
    text='txt',
    #color = 'Topic:N'
)

# 5. Visualizes Bubbles and Labels in an interactive Map
bubbles.encode().interactive().properties(height=700,width=700,
                                          title="Restaurant Experiences") + text

## 1.10 Search for Similar Sentences
- Sometimes, we want to explore similar sentences to learn more about a text corpus.
- Finding similar sentences is easy with SBERT: ***A search utility comes with SBERT!***
- Can be helpful when you investigate a particular topic, person, brand, firm, etc.

In [None]:
# 1. Define query sentences:
queries = ['I hate their aweful pasta.',
           'The floor is stained and dirty.',
           'The waiter was so cute.']

# 2. Embed query sentences with SBERT
query_embedding = embedder.encode(queries, batch_size=64, show_progress_bar=True, convert_to_tensor=True)

In [None]:
# 3. Use semantic search function to find top_3 similar sentences to each query
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=3)
hits

In [None]:
# 4. Pick a query and print top_k sentences including their original index and the similarity score to the query
_hits = hits[0]      #Get the hits for the first query at index 1 in hits
for hit in _hits:
    print(hit['corpus_id'], corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

# 2. Topic Discovery on X(Twitter) with Sentence Embedding using LLMs

## 2.1 Hospitality during the CoVID-19 Pandemic

- Now that we know how to embed entire sentences, let's use sentence embedding to ***discover what people are twittering about***
- In today's class, we will examine what people have to say about ***Graduate Hotels***
  - Privately owned collection of boutique hotels
  - College themed
  - Over 30 locations worldwide
  - We have one right on Franklin Street
- I collected data on variations of the term "Graduate Hotels" from X(Twitter)

## 2.2 Load Tweets

In [None]:
# 1. Connect your Google Drive
from google.colab import drive
drive.mount('/content/drive')

# 2. Navigate to the folder where the files for Class 07 are:
%cd /content/drive/MyDrive/488/Class07

# 3. See what is in the folder: Special shell command to view the files in the current directory of the notebook environment
!ls

In [None]:
# 1. Load file into DataFrame (Data is in `data` subfolder -- load to your Google Drive if necessary
tweets = pd.read_json('Graduate2022-2k.json', lines=True)

# 2. Keep only certain columns
tweets = tweets.filter(['id','content','date'], axis=1)
tweets.rename(columns={'content':'Tweet'}, inplace=True)
tweets

## 2.3 Pre-process Tweets: **The Tweet Preprocessor**

Let's leverage the work of someone else (https://github.com/s) to preprocess our tweets. They created a [tweet preprocessor](https://github.com/s/preprocessor) as part of their bachelor thesis on sentiment analysis.

It gives several options for elements that you may want to remove (i.e., clean).  Or we can apply the manual approach that I showed in class 17:

| Option   Name  | Option Short Code |
|----------------|-------------------|
| URL            | p.OPT.URL         |
| Mention        | p.OPT.MENTION     |
| Hashtag        | p.OPT.HASHTAG     |
| Reserved Words | p.OPT.RESERVED    |
| Emoji          | p.OPT.EMOJI       |
| Smiley         | p.OPT.SMILEY      |
| Number         | p.OPT.NUMBER      |

In [None]:
# 1. You need to install it first:
!pip install tweet-preprocessor

In [None]:
# 2. Preprocess the tweets

# a. Import the preprocessor
import preprocessor as prepro

# b. Set options to remove URL, Reserved word
prepro.set_options(prepro.OPT.URL, prepro.OPT.RESERVED, prepro.OPT.MENTION, prepro.OPT.HASHTAG)

# c. Let's do it for all tweets
tweets['text']  = tweets['Tweet'].apply(prepro.clean)

# d. Check our work
tweets['text'].head(10)

- Because we scraped the tweets from the internet, the tweet preprocessor may not have dealt with special HTML entities such as the â‚¬ symbol.
- We also want to remove line breaks, tabs and the @ and #.

In [None]:
# 3. Fix some things the preprocessor missed
htmlents = r'|'.join((r'&copy;',r'&reg;',r'&quot;',r'&gt;',r'&lt;',r'&nbsp;',r'&apos;',r'&cent;',r'&euro;',r'&pound;'))
tweets.text = tweets.text.replace(
    {htmlents:'',       # remove html punctuation codes
     '#|@':'',          # remove hashtag # and reference @, leaving tags (unless preprocessor removed already)
     '&amp;':' and ',   # &amp; to and
     '\n|\t':' '}, regex=True) # strip HTMLentries, hash tag markers, reference @, newlines
tweets.text = tweets.text.str.strip().replace({' +':' '},regex=True) # collapse extra spaces
# Check our Work
tweets.text.tail(10)

- Our data may include the same tweet multiple times.
- We will remove identical tweets before our analysis as follows:

In [None]:
# 4. Remove duplicate tweets and reindex

print(tweets.shape)
tweets.drop_duplicates(subset='text', keep="first", inplace=True)
tweets.drop_duplicates(subset='id', keep="first", inplace=True)
tweets.reset_index(drop=True, inplace=True)
print(tweets.shape)

## 2.4 BERTopic - A convenient tool for Topic Discovery

![BERTopic](https://maartengr.github.io/BERTopic/logo.png "BERTopic")

- Topic modeling technique that leverages sentence transformers and c-TF-IDF to
  * Create dense clusters of text
  * That allow for easily interpretable topics
  * Whilst keeping important words in the topic descriptions

- BERTopic is essentially a sequence of steps to create its topic representations. There are five steps to this process:

![BERTopic](https://maartengr.github.io/BERTopic/algorithm/default.svg "BERTopic")

### 2.4.1. Set-up BERTopic

- Install it
- Load it
- Fit it to our pre-processed Tweets

In [None]:
# 1. Let's install BERTopic
!pip install bertopic

In [None]:
# 2. Import libraries
#import numpy as np #(aleady done)
from bertopic import BERTopic

# 3. Set-up BERTopic model
topic_model = BERTopic(verbose=True)

# 4. Convert tweets to list
docs = tweets.text.to_list()

# 5. Find topics using BERTopic
topics, probabilities = topic_model.fit_transform(docs)

### 2.4.2. Explore discovered Topics
- Frequencies
- Words
- Visualize

In [None]:
# 1. Let's see how many topics we found (Topic -1 means that these tweets are not associated with any topic!)
topic_model.get_topic_info().head(11)

In [None]:
# 2. Let's look at the words and their topic probabilities that are sssociated with an indivual topic: Topic 1
topic_model.get_topic(1)

In [None]:
# 3. Let's visually explore topics
topic_model.visualize_topics()

In [None]:
# 4. We can also get a Barchart for the topics with the most relevant words
topic_model.visualize_barchart(top_n_topics=10)

In [None]:
# 5. Which topic would a tweet (or text) best fit into?
new_doc = "Had a hard time to get hall pass - hate it!"
topic, score = topic_model.transform([new_doc])
print(f'Best match is topic {topic[0]} with probability {score[0]}')

In [None]:
# 6. Find topics that a word is most likely associated with
pd.DataFrame(topic_model.find_topics("love")) # most relevant is with highest score (row 1, column 0), where topic number is in row 0, column 0.

In [None]:
# 6. Save a fitted BERTopic model
topic_model.save("graduatetweets")

# 6a. Load a fitted BERTopic model
graduate_model = BERTopic.load("graduatetweets")

# 6b. Test loaded model for same results
pd.DataFrame(graduate_model.find_topics("love")) # most relevant is with highest score (row 1, column 0), where topic number is in row 0, column 0.

### 2.4.3. Visualize the Tweet Landscape

Let's explore how all the Tweets we collected are related to another in a 2D Map



In [None]:
# 1. Import libraries
#import numpy as np #(aleady done)
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
# from bertopic import BERTopic #already done

# set a seed for reproducible results
import numpy as np
np.random.seed(0)


# 2. Set-up BERTopic model with more control over it's individual components
langmodel ="all-MiniLM-L6-v2"
clustering=HDBSCAN(min_cluster_size=10)
dimreduct=UMAP(n_components=2, random_state=0)
vects=CountVectorizer(ngram_range=(1, 3), stop_words="english")
topic_model = BERTopic(embedding_model=langmodel, umap_model=dimreduct, hdbscan_model=clustering, vectorizer_model=vects, verbose=True)

# 3. Convert tweets to list
docs = tweets.text.to_list()

# 5. Find topics using BERTopic
topics, probabilities = topic_model.fit_transform(docs)

In [None]:
# 6. Let's get the dimensionality reduced embeddings from BERTopic:
umap_embeddings = topic_model.umap_model.embedding_
print(umap_embeddings.shape)

In [None]:
# 7. Import Altair
import altair as alt

# 8. Create a new DataFrame that holds all the information we need for our map
import pandas as pd
source = pd.DataFrame(
    {'x': umap_embeddings[:, 0],
     'y': umap_embeddings[:, 1],
     'txt': docs,
     'Topic' : topics
     #'size'  : 100
    })

# 9. Define Bubbles on Map
bubbles = alt.Chart(source).mark_circle(size=100).encode(
    x=alt.X('x:Q', axis=alt.Axis(title="not directly interpretable", grid=False, labels=False),scale=alt.Scale(domain=[min(source.x)-1, max(source.x)+1])),
    y=alt.Y('y:Q', axis=alt.Axis(title="not directly interpretable", grid=False, labels=False),scale=alt.Scale(domain=[min(source.y)-1, max(source.y)+1])),
    #size='size',
    color = 'Topic:N',
    tooltip=[alt.Tooltip('txt', title='Tweet'),    # We can include a lot of information in the tooltips (mouseover pop-up)
             alt.Tooltip('Topic', title='Topic')
            ]
)

# # 4. Define Labels next to Bubbles on Map
# text = alt.Chart(source).mark_text(
#     align='left',
#     baseline='middle',
#     dx=10 # offset label in x coordinate
# ).encode(
#     x='x:Q',
#     y='y:Q',
#     text='txt',
#     #color = 'Topic:N'
#)

# 10. Visualizes Bubbles and Labels in an interactive Map
bubbles.encode().interactive().properties(height=700,width=700,
                                          title="Graduate Hotel Tweets")# + text