# **Unstructured Data**: Introduction to Text Analysis

> Note: To run this notebook, simply download it and then upload it to Google Colab. Then, select a runtime and simply run the code inside of the notebook. You are also more than encouraged to read all the content in this notebook!

Technologies Used:

![Python](https://img.shields.io/badge/python-v3.7+-blue.svg)
![Jupyter](https://img.shields.io/badge/jupyter-v1.0+-blue.svg)
![NLTK](https://img.shields.io/badge/nltk-v3.6.2-blue.svg)
![re](https://img.shields.io/badge/re-v2.2.1-blue.svg)
![collections](https://img.shields.io/badge/collections-v0.9-blue.svg)
![Google Colab](https://img.shields.io/badge/google_colab-v1.0+-blue.svg)


# 1. Text Analysis

#### What is Text Analysis?

- Text Analysis (also called text mining or text extraction) is about parsing texts in order to extract machine-readable facts from them
- The purpose of Text Analysis is to create structured data out of free text content
- The process can be thought of as slicing and dicing heaps of unstructured, heterogeneous documents into easy-to-manage and interpret data pieces

**Text Analysis** is the term that describes the very process of computational analysis of texts.  

vs.   

**Text Analytics** which are a set of techniques and approaches that create *insights* by identifying *trends* and/or *patterns* in the prepared data.  

> ### ***Key Take-away:***  Text Analysis helps translate a text into the language of data. Once Text Analysis ‚Äúprepared‚Äù the content, Text Analytics kicks in to help make sense of these data.

### Text Analysis for Business

- Firms use Text Analysis to set the stage for a data-driven approach towards managing content.
- Once textual sources are sliced into easy-to-automate data pieces, a whole new set of opportunities opens for processes like:
    - decision making
    - product development
    - marketing optimization
    - business intelligence
    - process automation
    - and more
    

### The Big Picture: Natural Language Processing (NLP)

- NLP is a subfield of linguistics, computer science, information engineering, and artificial intelligence
- Concerned with the interactions between computers and human (natural) languages
- Central question: How to program computers to process and analyze large amounts of natural language data

**Challenges in NLP:**
- Speech recognition
- Natural language understanding
- Natural language translation
- Natural language generation


### **Today, we examine NLP foundations for Text Analysis**

## 1.1 Text: here, there, and everywhere!

Text is a very common type of unstructured data.

![Unstructured Data - EVERYWHERE!](https://mapxp.app/BUSI488/GoogleKFBS "Unstructured Data - EVERYWHERE!")


### Text is EVERYWHERE!





## 1.2 Text Analysis Tools and Tasks


![Text Analysis Tools and Tasks](https://mapxp.app/BUSI488/TextAnalysis-Tools+Tasks.jpg "Text Analysis Tools and Tasks")

### *Let's roll-up our sleeves and build the foundations of text analysis together!*  

   
   

# 2. Regular Expressions

At the most basic level, Text Analysis helps analysts to:

1. Find information in Text
2. Extract information from Text
3. Count word frequencies



## 2.1 Introduction to Regular Expression

#### *What are regular expressions (RegEx)?*

- A RegEx is a pattern string with syntax to allow alternatives, wildcards, and repeats
- A RegEx either matches another string or does not.
- Why are they defined?
   - Each RegEx converts to a simple programs with no variables; fast matching!
   - There is no RegEx for the pattern "every ( has a closing )"; RegEx can't count...


#### Applications of regular expressions:
- Find all web links or phone numbers in a document
- Parse email addresses, remove/replace unwanted characters


## 2.2 Basic RegEx

Python syntax: (Other languages will differ. E.g. SQL will use _ for . and % for .*)

![Basic RegEx](https://mapxp.app/BUSI488/RegEx.jpg "BasicRegEx")

### 2.2.1 Pythons re (RegEx) module

- split: split a string on regex
- findall: find all patterns in a string
- search: search for a pattern
- match: match an entire string or substring based on a pattern

**Example:**  re.split('PATTERN', 'Your string.')

- Pattern first, and the string second
- May return an iterator, string, or match object
- GeminiAI can write these for you once you know they exist


In [None]:
# Splitting a Sentence into words

# 1. Import necessary modules
import re

# 2. Define text
my_string = "The ceiling is the roof!"

# 3. Split by Spaces
pattern = r"\s+"
display(re.split(pattern, my_string ))

In [None]:
import re

# 3. Split by Spaces
pattern = r"\s+"
word_list = re.split(pattern, my_string)
word_list

**Note**: the *r* before the patterns indicates that this is a raw string where all escape codes are to be ignored.

*For example:*

*'\n'* will be treated as a newline character, while **r**'\n' will be treated as the characters \ followed by n.

## 2.3 Splitting-up Text into Words: Excerpt of OpenAI/Microsoft Press Release 2023

In [None]:
# 1. Press Release Text
openai='Today, we are announcing the third phase of our long-term partnership with OpenAI through a multiyear, multibillion dollar investment to accelerate AI breakthroughs to ensure these benefits are broadly shared with the world.\
This agreement follows our previous investments in 2019 and 2021. It extends our ongoing collaboration across AI supercomputing and research and enables each of us to independently commercialize the resulting advanced AI technologies.\
Supercomputing at scale ‚Äì Microsoft will increase our investments in the development and deployment of specialized supercomputing systems to accelerate OpenAI\‚Äôs groundbreaking independent AI research. We will also continue to build out Azure\‚Äôs leading AI infrastructure to help customers build and deploy their AI applications on a global scale.\
New AI-powered experiences ‚Äì Microsoft will deploy OpenAI‚Äôs models across our consumer and enterprise products and introduce new categories of digital experiences built on OpenAI‚Äôs technology. This includes Microsoft‚Äôs Azure OpenAI Service, which empowers developers to build cutting-edge AI applications through direct access to OpenAI models backed by Azure‚Äôs trusted, enterprise-grade capabilities and AI-optimized infrastructure and tools.\
Exclusive cloud provider ‚Äì As OpenAI‚Äôs exclusive cloud provider, Azure will power all OpenAI workloads across research, products and API services.\
‚ÄúWe formed our partnership with OpenAI around a shared ambition to responsibly advance cutting-edge AI research and democratize AI as a new technology platform,‚Äù said Satya Nadella, Chairman and CEO, Microsoft. ‚ÄúIn this next phase of our partnership, developers and organizations across industries will have access to the best AI infrastructure, models, and toolchain with Azure to build and run their applications.‚Äù\
‚ÄúThe past three years of our partnership have been great,‚Äù said Sam Altman, CEO of OpenAI. ‚ÄúMicrosoft shares our values and we are excited to continue our independent research and work toward creating advanced AI that benefits everyone.‚Äù\
Since 2016, Microsoft has committed to building Azure into an AI supercomputer for the world, serving as the foundation of our vision to democratize AI as a platform. Through our initial investment and collaboration, Microsoft and OpenAI pushed the frontier of cloud supercomputing technology, announcing our first top-5 supercomputer in 2020, and subsequently constructing multiple AI supercomputing systems at massive scale. OpenAI has used this infrastructure to train its breakthrough models, which are now deployed in Azure to power category-defining AI products like GitHub Copilot, DALL¬∑E 2 and ChatGPT.\
These innovations have captured imaginations and introduced large-scale AI as a powerful, general-purpose technology platform that we believe will create transformative impact at the magnitude of the personal computer, the internet, mobile devices and the cloud.\
Underpinning all of our efforts is Microsoft and OpenAI‚Äôs shared commitment to building AI systems and products that are trustworthy and safe. OpenAI‚Äôs leading research on AI Alignment and Microsoft‚Äôs Responsible AI Standard not only establish a leading and advancing framework for the safe deployment of our own AI technologies, but will also help guide the industry toward more responsible outcomes."'

In [None]:
# prompt: How many times does "the" appear as a word in text openai?

import re
words = re.findall(r'\Wthe\W', openai.lower())
print(len(words))

In [None]:
# 2. Define Pattern
PATTERN =  r"\w+"

# 3. Use findall to get all words from a recent press release by OpenAi and Microsoft
set(re.findall(PATTERN, openai))

In [None]:
# 4. Let's use RegEX to extract some data from a sentence in the above excerpt:
my_text = 'Exclusive cloud provider ‚Äì As OpenAI‚Äôs exclusive cloud provider, Azure will power all OpenAI workloads 24/7 across research, products and API services in 2023 and beyond.'

In [None]:
# 5. Split my_text on spaces and display the result
spaces = r"\s+"
display(set(re.split(spaces, my_text)))

In [None]:
# 6. Find all digits in my_text and display the result
digits = r"\d+"
display(re.findall(digits, my_text))

In [None]:
# 7. Find all unique words in my_text and display the result
words = r"\w+"
display(set(re.findall(words, my_text)))

***What is the difference between the output of 5. (split \s+)  and 7. (findall \w+)?***

1. splits by space and includes special characters like -
3. findall extracts only alphanumeric

In [None]:
import re
from collections import Counter

# Preprocess the text
text = re.sub(r'[^\w\s]', '', openai).lower() # Remove punctuation and lowercase
words = text.split()

# Count word occurrences
word_counts = Counter(words)

# Find and print words that appear more than once
for word, count in word_counts.items():
    if count > 1:
        print(f"{word}: {count}")

## 2.4 Finding Information in Text using RegEx

You have several options to find words or parts of words in a text. Three common functions are:
1. re.match()     - tries to match from the beginning
2. re.search()    - searches through entire string
3. re.findall()   - finds *all* the matches and returns them as a list of s strings, with each string representing one match.

*What is the difference?*

## 2.5 Extracting Information from Product Descriptions

![Swagtron Text Analysis](https://mapxp.app/BUSI488/SwagtronEB7.jpg "Swagtron Text Analysis")


In [None]:
# 1. Import required packages
from google.colab import drive

# 2. Mount google drive
drive.mount('/content/gdrive')

# 3. Change into the directory our data are in
%cd /content/gdrive/MyDrive/488/Class04

# 4. List files in current directory
!ls # special shell command to view the files in the home directory of the notebook environment

In [None]:
# 5. Load the file and display text
general_description = open("simple_product_text.txt").read()
display(general_description)

### 2.5.1 Brand Mentions in Text

How prevalent is a brand in a product description?

We can easily find out how many times your brand is mentioned in a text
- or in many descriptions
- or in reviews
- or in news
- or in any other text source

In [None]:
# 1. Define pattern you are looking for: Here a Brand Name
pattern = r'Shimano'

# 2. Use re.findall to find our pattern (the brand name) in entire text. We set a flag to ignore the case (i.e., allow any capitalization)
match = re.findall(pattern, general_description, flags=re.IGNORECASE)
print(match)

# 3. Because we get list back, we import counter to count how many times our pattern appeared in the text. Returns a dict!
from collections import Counter
print (Counter(match))

### 2.5.2 Extracting key performance indicators (KPIs) from text

Perhaps you are in search of specific types of numbers?

In [None]:
# 1. Define a pattern for anything with a percent
pattern = r'\w*.%'

# 2. Use re.search to find the first word that is followed by a % symbol
display(re.findall(pattern, general_description))

In [None]:
# Initially we used +, which is a mistake.  Do you see why?
display(re.findall(r'\w+.%', " 1%. then 2 % and 3%.  Then 45%. Finally 100%, which is 100 %."))

## 2.6 More Complex Regular Expressions

RegEx is extremely flexible and powerful. You can use them to find:
- groups of words using ()
- character ranges using []
- and either one *or* the other using "|"


![More Complex RegEx](https://mapxp.app/BUSI488/ComplexRegEx.jpg "More Complex RegEx")

There is a great regular expression editor that you might want to try out: https://regex101.com/

***HINT:*** Use genAI to build your regular expression. Then test them in the regular expression editor and check for "corner solutions" (i.e., create and test rare and difficult cases).


### 2.6.1 Extracting Phone Numbers from Text

Using regular expressions it is easy to extract even more complex numbers like phone numbers or social security numbers.

***Why might this be useful?***

In [None]:
# 1. Extracting phone numbers from a text
display(re.findall(r'[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]', general_description))

In [None]:
# 3. The RegEx will find phone numbers of different formats!
alternative_text = '+79082343434 keeps going  8(912)2342554,  +7 982 342 some random words 911 77 more stuff 8-923-132-34-23\
                    +7 982 342 34 34! who knows! I call 919-962-8746 if I have questions.'

# 4. Display our Text
print(alternative_text)
print()

# 5. Collect all phone numbers
display(re.findall(r'[\+\(]?[1-9][0-9 .\-\(\)]{8,}[0-9]', alternative_text))

### 2.6.2 Checking a Word's *Neighborhood* in Text

It might be more informative to also examine the words around a word of interest, that is, the context.

***Why?***

In [None]:
# Find the word(s) preceding and following a BRAND in a product description

# 1. Define the pattern (include forward and backward looking elements)
pattern= r'((?:\S+\s+){0,3}\bShimano\b\s*(?:\S+\b\s*){0,3})'

# 2. Find your brand with the words preceding and following it
match = re.findall(pattern, general_description, flags=re.IGNORECASE)
display(match)

# 3. Tokenization: An higher-level way to split Text into Words

- Turning a string or document into **tokens** (smaller chunks)
- An early step in preparing a text for Text Analytics and NLP
- Many different theories and rules what these text chunks should look like

**Great News! Today, you already learned how to create your own rules using regular expressions!**

*Some examples of what tokenization normally does:*

- Breaking out words or sentences
- Separating punctuation
- Separating all hashtags in a tweet

--> Many analysts use the Python library called **nltk** (natural language toolkit) for tokenization.

***Let's use nltk to tokenize some text!***

In [None]:
import nltk
nltk.download('punkt_tab')

In [None]:
# 1. Import the Tokenizer
from nltk.tokenize import word_tokenize

# 2. And start tokenizing right out of the box!
print(word_tokenize("What is RedNote, the Chinese social media app that US TikTokers are flocking to?")) # https://edition.cnn.com/2025/01/14/tech/rednote-china-popularity-us-tiktok-ban-intl-hnk/index.html
print(word_tokenize("How should we test AI for human-level intelligence? OpenAI's o3 electrifies quest")) # https://www.nature.com/articles/d41586-025-00110-6

## 3.1 Why Tokenize?

Helps us with simple text processing tasks:

- Easier to map part of speech
- Matching common words
- Removing unwanted tokens (e.g., common words, repeated words, etc.)

Example text: "I don't like Dr. D's bowtie!"  
Tokenized: ['I', 'do', "n't", 'like', 'Dr.', 'D', "'s", 'bowtie', '!']  

***What can we learn?***
- negation from "n't"
- possession from "'s"

***Tokenization can give us a first hint at the meaning of the text***

***Good News!*** You have several powerful tokenizers installed on your system with nltk

- **sent_tokenize**: tokenize a document into sentences
- **regexp_tokenize**: tokenize a string or document based on a regular expression pattern
- **tweetTokenizer**: special class just for tweet tokenization, allowing you to separate hashtags, mentions, and lots of exclamation points!!!



In [None]:
# 1. Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# 2. load the file and display text
general_description = open("simple_product_text.txt").read()
display(general_description)

In [None]:
# 3. Split general_descriptiongeneral_description into sentences:
sentences = sent_tokenize(general_description)
display(sentences)

In [None]:
# 4. Let's use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])
display(tokenized_sent)

In [None]:
# 6. Finally, let's generate a set of unique tokens in the entire general_description (set is practical because mathematically, every element of a set is unique!)
unique_tokens = set(word_tokenize(general_description))
display(unique_tokens)

# 4. Tokenizing Social Media: UGC from X(Twitter)


![Tokenizing Social Media ](https://www.mapxp.app/BUSI488/tokenizesocialmedia.jpg "Tokenizing Social Media ")

Social Media is a frequently used source consumer and business insights.  
However, social media posts present some challenges:
- Many acronyms (lol, IMHO, BTW)
- Very short text
- Little punctuation
- Emojis ü§ì
- Hashtags (#)
- Mentions (@)

Let's build a more complex tokenizer for posts with hashtags and mentions using nltk and regex.  

The nltk.tokenize.TweetTokenizer class provides us with some extra methods and attributes for parsing tweets.

In [None]:
# Below are 4 tweets that we will analyze

tweets = ['This is the best course @KFBS ever! #AI #datascience ü§ì',
         'FB‚Äôs stock finally crashed, drive by three things: competition (#TikTok), loss of access to data (#Apple ATT), and huge spending on virtual reality. Hard to see any of these issues going away soon. I went on @CNBC @SquawkStreet to talk about it.',
         '#NLP is SUPER cool <3 bc it helps us understand the World better :) #learning',
         'Thanks @KFBS for updating your values to #integrity, #inclusion, #impact and #innovation']

# What do you notice about these tweets?

## 4.1 Extracting Hashtags from Tweets

In [None]:
# 1. Start by importing the necessary modules
from nltk.tokenize import regexp_tokenize

# 2. Define a regex pattern to find hashtags: pattern1
pattern = r'#\w+'

# 3. Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern)
display(hashtags)

## 4.2 Extracting Mentions from Tweets

In [None]:
# 1. Write a pattern that matches mentions (@)
pattern = r'([@]\w+)'

# 2. Use the pattern on the second tweet in the tweets list
mentions = regexp_tokenize(tweets[1], pattern)
display(mentions)

## 4.3 Extracting Hashtags and Mentions from Tweets

In [None]:
# 1. Write a pattern that matches mentions (@)
pattern = r'([@#]\w+)'

# 2. Use the pattern on the second tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[1], pattern)
display(mentions_hashtags)

## 4.4 The TweetTokenizer

Doing everything manually is only fun for some time ...

There are easier options for Tweets!

In [None]:
# Use the TweetTokenizer to tokenize all tweets into one list
from nltk.tokenize import TweetTokenizer

# 1. Instantiate the model (i.e., the tweet tokenizer)
tknzr = TweetTokenizer()

# 2. Feed all tweets into it, tweet by tweet to get the tokens for each tweet in a list of lists
all_tokens = [tknzr.tokenize(t) for t in tweets]

# 3. Show list of lists that contain the tokens of each tweet
display(all_tokens)

# 5. Bag-of-Words (BoW)

![ Bag-of-Words](https://www.mapxp.app/BUSI488/bagofwords.jpg " Bag-of-Words")


***Basic method for finding topics in a text***
## Need to first create tokens using tokenization
- Then count up all the tokens
- The more frequent a word, the more important it might be
- Can be a great way to determine the significant words in a text

**Example Text:**  


> *The ceiling is the roof! Beat DUKE! Carolina beats Duke University in terms of NCAA titles. CAROLINA and DUKE are Rivals forever! And we'll continue beating them. Carolina rules the UNIVERSE!*


***Let's create a Bag of Words (BoW)!***

In [None]:
# 1. Start by importing the necessary modules
from nltk.tokenize import word_tokenize
from collections import Counter

# 2. Define your text and tokenize it
my_text = "The ceiling is the roof! Beat DUKE! Carolina beats Duke University in terms of NCAA titles. CAROLINA and DUKE are Rivals forever! And we'll continue beating them. Carolina rules the UNIVERSE!"
BoW = word_tokenize(my_text)

# 3. User counter to find the word frequency - creates a dict
token_counts = Counter(BoW)
display(token_counts)

## 5.1 Get Ready to Preprocess Text

**Challenges in Text Analysis**
- There is a lot of variation in the way people write text
- Not every character and/or string in a text is relevant to us
- Sometimes several words essentially mean the same thing
- Sometimes several words look like they mean the same thing, but actually don't

*By preprocessing text, we can start to overcome these challenges!*

**Preprocessing of text can include many steps such as:**

- Tokenization to create a bag of words
- Lowercasing words
- Shorten words to their root stems
- Lemmatization
- Removing stop words, punctuation, and/or unwanted tokens

***Good to experiment with different approaches!***


In [None]:
# 0. Please run pip install on your computer to install the nltk library (natural language toolkit)
# It is already installed on CoLab!
# !pip install nltk

# 1. Now import it and download:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# 2. Alternatively, download everything! But that takes a moment and we won't need it.
#nltk.download() # type when asked what to do: "d all"

## 5.2 Lower-case for easier processing

Upper and lower-case words are tokenized as different words, even when they are the exact same word.
- Sometimes desirable (e.g., sentiment analysis)
- Often not desirable

***Let's fix it!***

In [None]:
# 1. Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in BoW]

# 2. Show frequencies of lower case tokens
token_counts = Counter(lower_tokens)
display(token_counts)

In [None]:
# 3. We can output the three most common tokes to make things easier to read:
token_counts.most_common(3)

## 5.3 Stopwords

Text usually contains a lot of stopwords that are not necessarily meaningful to our analysis:
- A stop word is a commonly used word (such as ‚Äúthe‚Äù, ‚Äúa‚Äù, ‚Äúan‚Äù, ‚Äúin‚Äù)
- What exactly these words are depends on the analyst (or the NLP software they use)
- We can easily remove stopwords from our text using the nltk library

***Let's remove some stopwords!***

In [None]:
# Let's remove some stopwords from our text

# 1. Import required packages
from nltk.corpus import stopwords

# 2. Remove stopwords
no_stops = [t for t in lower_tokens
            if t not in stopwords.words('english')]

# 3. Let's take a look at our text without stopwords
display(no_stops)

# 4. Let's count the three most common tokens again
display(Counter(no_stops).most_common(3))

## 5.4 Removing Punctuation

- We can make our analysis even simpler by removing punctuation form our text.   
- A straight-forward approach is to remove all words that are not alphabetic

***Let's drop the punctuation!***

In [None]:
# 1. Create tokens
no_punct = [w for w in no_stops
        if w.isalpha()]  #returns "true" if string only includes alphabetical strings

# 2. Let's take a look at our text without punctuation
display(no_punct)

# 3. Let's count the three most common tokens again
display(Counter(no_punct).most_common(3))

## 5.5 Stemming

Notice how the words "beat", "beats" and "beating" are each counted only once although they essentially mean the same thing.   


- Languages we speak and write are made up of several words often derived from one another.
- When a language contains words that are derived from another word as their use in the speech changes is called **Inflected Language**.  


- A **stem** (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as:  
  - -ed, -ize, -s   
  - de-, mis-
- Stems are created by removing the suffixes or prefixes used with a word.
- So stemming a word or sentence may result in words that are not actual words.


#### NLTK helps us stem words with so-called ***Stemmers***
- **PorterStemmer**: oldest (1979), uses *Suffix Stripping*
- **LancasterStemmer**: newer (1990), more aggressive, iterative, can "over-stem" more easily  

***Beware!*** Over-stemming can give you word stems that have no interpretable meaning!

![Stemming](https://www.mapxp.app/MBA742/stemming.jpg "Stemming")

In [None]:
# Stemming Example

# 1. Import stemming module
from nltk.stem import PorterStemmer

# 2. Instantiate the stemmer
ps = PorterStemmer()

# 3. Let's do some stemming!
for w in ['Consultant', 'Consulting', 'Consultation', 'Consultants', 'Consult']:
    print(w, " : ", ps.stem(w))


***Alright, let's clean-up our text about the Carolina and Duke Rivalry!***

In [None]:
# 1. Import stemming module
from nltk.stem import PorterStemmer

# 2. Instantiate the stemmer
ps = PorterStemmer()

# 3. Let's do some stemming!
for w in no_punct:
    print(w, " : ", ps.stem(w))

# 4. Let's count the two most common tokens again
stemmed = [ps.stem(w) for w in no_punct]
display(Counter(stemmed).most_common(3))

![Carolina Beats Duke](https://www.mapxp.app/MBA742/carolinabeatsduke.jpg "Carolina Beats Duke")

## 5.6 Lemmatization

***University = Universe?***  

> *Universal Studio's movie about a University's study in which students studied the universe is universally acclaimed*


- Lemmatization is the algorithmic process of finding the lemma of a word ***depending on its meaning***.
- It usually refers to the morphological analysis of words, which aims to remove inflectional endings.
- It helps in returning the ***base or dictionary form of a word***, which is known as the _lemma_.

*A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.*

**Why is Lemmatization better than Stemming?**

- Stemming algorithms essentially cut the suffix or prefix from a word.
- Lemmatization takes into consideration morphological analysis of words.
    - Returns the lemma which is the base form of all its inflectional forms.
    - In-depth linguistic knowledge is required to create dictionaries and look for the proper form of the word.
- Stemming can be thought of a more general operation, while lemmatization is an intelligent operation.

In [None]:
# Lemmatization Example

# 1. Import lemmatization module
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')

# 2. Instantiate the lemmatizer
le = WordNetLemmatizer()

# 3. Let's do some Lemmatization!
print("Lemma:")
for w in ['University', 'Universe', 'Universal']:
    print(w, " : ", le.lemmatize(w))

# 4. vs. Stemming
print("\nStem:")
for w in ['University', 'Universe', 'Universal']:
    print(w, " : ", ps.stem(w))
print("\n")

## 6. Na√Øve Topic Discovery

By studying the frequency of relevant words in a text, we can potentially learn what the text is about.

--> We might identify the central ***Topic(s)*** of a text without having to read the text itself.

## 6.1 Topic Discovery in Tesla's 2020 10-K



---
The COVID-19 pandemic impacted our business and financial results in 2020. The temporary suspension of production at our factories during the first half of 2020 caused production limitations that, together with reduced or closed government and third party partner operations in the year, negatively impacted our deliveries and deployments in 2020. While we resumed operations at all of our factories worldwide, our temporary suspension at our factories resulted in idle capacity charges as we still incurred fixed costs such as depreciation, certain payroll related expenses and property taxes. As part of our response strategy to the business disruptions and uncertainty around macroeconomic conditions caused by the COVID-19 pandemic, we instituted cost reduction initiatives across our business globally to be commensurate to the scope of our operations while they were scaled back in the first half of 2020. This included temporary labor cost reduction measures such as employee furloughs and compensation reductions. Additionally, we suspended non-critical operating spend and opportunistically renegotiated supplier and vendor arrangements. As part of various governmental responses to the pandemic granted to companies globally, we received certain payroll related benefits which helped to reduce the impact of the COVID-19 pandemic on our financial results. Such payroll related benefits related to our direct headcount have been primarily netted against our disclosed idle capacity charges and they marginally reduced our operating expenses. The impact of the idle capacity charges incurred during the first half of 2020 were almost entirely offset by our cost savings initiatives and payroll related benefits.

---
***Your new Text Analysis skills might help!***


*Source* https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm#ITEM_7A_QUANTITATIVE_QUALITATIVE_DISCLOS


In [None]:
# 0. Define our text
tesla10k2020 = "The COVID-19 pandemic impacted our business and financial results in 2020. The temporary suspension of production at our factories during the first half of 2020 caused production limitations that, together with reduced or closed government and third party partner operations in the year, negatively impacted our deliveries and deployments in 2020. While we resumed operations at all of our factories worldwide, our temporary suspension at our factories resulted in idle capacity charges as we still incurred fixed costs such as depreciation, certain payroll related expenses and property taxes. As part of our response strategy to the business disruptions and uncertainty around macroeconomic conditions caused by the COVID-19 pandemic, we instituted cost reduction initiatives across our business globally to be commensurate to the scope of our operations while they were scaled back in the first half of 2020. This included temporary labor cost reduction measures such as employee furloughs and compensation reductions. Additionally, we suspended non-critical operating spend and opportunistically renegotiated supplier and vendor arrangements. As part of various governmental responses to the pandemic granted to companies globally, we received certain payroll related benefits which helped to reduce the impact of the COVID-19 pandemic on our financial results. Such payroll related benefits related to our direct headcount have been primarily netted against our disclosed idle capacity charges and they marginally reduced our operating expenses. The impact of the idle capacity charges incurred during the first half of 2020 were almost entirely offset by our cost savings initiatives and payroll related benefits."

# 1. Start by importing the necessary modules
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.corpus import stopwords

# 2. Tokenize
tokens = word_tokenize(tesla10k2020)

# 3. Lower-case
tokens = [t.lower() for t in tokens]

# 4. Remove Stopwords
tokens = [t for t in tokens
            if t not in stopwords.words('english')]
# 5. Remove Punctuation
tokens = [w for w in tokens
        if w.isalpha()]
# 6. Lemmatize
from nltk.stem import WordNetLemmatizer
le = WordNetLemmatizer()
lemmatized = [le.lemmatize(w) for w in tokens]

# 7. Show word frequencies
display(Counter(lemmatized).most_common(20))


# 6.2 Word Clouds

- Also known as **Tag Clouds**
- Visual representation of text data
- Typically used to visualize:
    - keyword metadata (tags) on websites
    - free form text.
- Tags are usually single words
    - importance of each tag is shown with:
        - font size
        - or color
- Useful for quickly perceiving the most prominent terms to determine its relative prominence.

*Source: Wikipedia*

### 6.2.1 Word Clouds from Text

> Some folks love Word Clouds.

> Personally, I don't.

> Nontheless, I'll show how to easily generate them.   

**Conveniently, the python module we will use does the preprocessing for us!**   
All we have to to is pass a text (as a string) to it.


for more details see: https://www.datacamp.com/community/tutorials/wordcloud-python

In [None]:
# Install WordCloud on your local computer (already installed on CoLab)
!pip install wordcloud

In [None]:
# 1. Import modules
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 2. Generate Word Cloud
wordcloud = WordCloud(collocations=True, width=800, height=500, random_state=5, max_font_size=110).generate(tesla10k2020)

# 3. Visualize Cloud
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

# 7. Sentiment Analysis

**Sentiment Analysis**, or **Opinion Mining**, is a sub-field of Natural Language Processing (NLP)
- Tries to identify and extract opinions within a given text.

**Definition of Sentiment**
1. A view of or attitude toward a situation or event; an opinion.  
"I agree with your sentiments regarding the road bridge"  


2. Exaggerated and self-indulgent feelings of tenderness, sadness, or nostalgia.  
"many of the appeals rely on treacly sentiment"

*Source: Oxford Dictionary*

**In this course, we will measure the polarity of text.**

**Polarity** in sentiment analysis refers to identifying sentiment orientation:
- positive
- neutral
- negative   

in written or spoken language


**Objetive of Sentiment Analysis**
Gauge the attitude, sentiments, evaluations, attitudes and emotions of a speaker/writer based on the computational treatment of subjectivity in a text.

## 7.1 Importance of Sentiment Analysis

- Enables companies to make sense out of unstructured data such as UGC (user generated content), news, or financial reports.
- Automated way to extract insights about perceptions, experiences, or positions toward something.

### **Sentiment Analysis for Businesses and Organizations**

>**Financial Market Forecasting:** Analyzing news articles and social media posts related to financial markets. Positive sentiment may suggest bullish trends, while negative sentiment may indicate bearish trends.

>**Brand Reputation and Crisis Management:** Monitoring social media and news articles. Real-time sentiment analysis allows businesses to respond quickly, address issues, and manage their brand reputation effectively.

>**Political Campaigns and Elections:** Analyzing social media discussions and news articles during political campaigns. Fine-tune campaign strategies and messaging to resonate with voters.

>**Employee Engagement and HR Management:** Analyzing employee feedback and surveys. Improve workplace culture and employee retention.

>**Strategic Planning**:Analyzing industry news and competitor mentions.Positive sentiment for competitors may suggest they are gaining market share, while negative sentiment can help identify vulnerabilities in the competition, leading to informed strategic decisions.

## 7.2 Advantage of Sentiment Analysis
- Sifting through huge volumes of text is **difficult** and **time-consuming**
- Requires **expertise** and **resources**

Sentiment Analysis:
- Enables firms to make sense out large amounts of textual: ***In an automated way***
- Allows firms to elicit vital insights from a vast unstructured dataset without having to manually process it

## 7.3 Limitations of Sentiment Analysis


1. Understanding emotions through text are not always easy:
    - 100% accuracy from a computer is not rational
    - A text may contain ***multiple*** sentiments (polarities) all at once
        - *‚ÄúThe coffee was great, but the service could have been better‚Äù.*
    - ***Figurative Speech*** is difficult for machines to understand
        - *‚ÄúThat coffee tastes very interesting"*  
        

2. Micro-blogging content from social media platforms such as X(Twitter) and Facebook poses serious challenges:  
    - large amount of data
    - language and expressions used to express sentiment
        - short forms
        - memes
        - emoticons


## 7.4 Sentiment Scoring - Polarity

In [None]:
# 0. Run once to install the Vader Sentiment Classification Package (if it is not already installed on your computer)
!pip install vaderSentiment

In [None]:
# 1. Import the module you need
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# 2. Instantiate the sentiment analyzer
analyser = SentimentIntensityAnalyzer()

In [None]:
# 3. Define a function that returns the polarity score of a sentence
def sentiment_analyzer_scores(sentence):
    score = analyser.polarity_scores(sentence)
    print("{:-<55} {}".format(sentence, str(score)), "\n")

In [None]:
# 4. Test how well Vader does
my_tweet = "UNC is not the best place to study data science for business!"
sentiment_analyzer_scores(my_tweet)

In [None]:
# 5. Print the individual polarity scores
sentiment_dict = analyser.polarity_scores(my_tweet)
print("Compound sentiment is", sentiment_dict['compound'], "\n")
print("Sentence was rated as", sentiment_dict['neg']*100, "% Negative")
print("Sentence was rated as", sentiment_dict['neu']*100, "% Neutral")
print("Sentence was rated as", sentiment_dict['pos']*100, "% Positive")

### 7.4.1 Interpreting Polarity

- The ***Positive***, ***Negative*** and ***Neutral scores*** represent the proportion of text that falls in these categories.   


- This means our sentence was rated as 46.0% Positive, 54.0% Neutral and 0% Negative.
    - should add up to 1   
    

- The ***Compound score*** is a metric that calculates the sum of all the lexicon ratings
    - which have been normalized between:
        - -1 (most extreme negative) and
        - +1 (most extreme positive)
    - the ranges of the compound scrore are:
        - positive sentiment: compound score >= 0.05
        - neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
        - negative sentiment: compound score <= -0.05

***DIY*** Try editing my_text:
- writing "good" in all caps
- add exclamation marks
- add an emoji (e.g., üòä)
- add adjectives or adverbs (e.g., degree modifiers)
- use synonyms for words
- add a conjunction (e.g., "but") to signal a shift in sentiment
- negate the sentence with a tri-gram (e.g., "my coffee isn't really all that great")
- turn it into a negative statement about the Corona virus
- use some slang (e.g., "that virus really SUX!")  

***What do you observe?*** *Paste your example text and the compound score into the Zoom chat window*

### 7.4.2 Where do Vader's scores come from?

***The authors of Vader built a rule-based sentiment analysis engine that uses a dictionary to classify sentiment***

**To build their dictionary, they did the following:**

- Sentiment ratings from 10 independent human raters (all pre-screened, trained, and quality checked for optimal inter-rater reliability).
- Over 9,000 token features were rated on a scale from "[‚Äì4] Extremely Negative" to "[4] Extremely Positive", with allowance for "[0] Neutral (or Neither, N/A)".
- We kept every lexical feature that had a non-zero mean rating, and whose standard deviation was less than 2.5 as determined by the aggregate of those ten independent raters.
- This left us with just over 7,500 lexical features with validated valence scores that indicated both the sentiment polarity (positive/negative), and the sentiment intensity on a scale from ‚Äì4 to +4.
- For example:
    - the word "okay" has a positive valence of 0.9, "good" is 1.9, and "great" is 3.1
    - whereas "horrible" is ‚Äì2.5, the frowning emoticon :( is ‚Äì2.2, and "sucks" and it's slang derivative "sux" are both ‚Äì1.5.

-----------------

***Excerpt from Vader's sentiment lexicon (dictionary) that can be found in the file "vader_lexicon.txt"***

| Word          | Polarity | Intensity | Ratings of 10 humans                     |
|---------------|----------|-----------|------------------------------------------|
| brightly      | 1.5      | 0.67082   | [2, 3, 1, 2, 1, 1, 2, 1, 1, 1]           |
| brightness    | 1.6      | 0.91652   | [2, 2, 1, 1, 1, 3, 3, 0, 2, 1]           |
| brightnesses  | 1.4      | 0.91652   | [2, 3, 1, 2, 1, 1, 0, 0, 2, 2]           |
| brights       | 0.4      | 0.66332   | [0, 0, 2, 0, 0, 1, 0, 0, 1, 0]           |
| brightwork    | 1.1      | 0.83066   | [1, 0, 1, 2, 1, 0, 3, 1, 1, 1]           |
| brilliance    | 2.9      | 0.83066   | [4, 3, 2, 4, 4, 3, 2, 3, 2, 2]           |
| brilliances   | 2.9      | 0.83066   | [3, 4, 3, 4, 4, 2, 3, 2, 2, 2]           |
| brilliancies  | 2.3      | 1.18743   | [1, 4, 1, 3, 3, 2, 1, 3, 4, 1]           |
| brilliancy    | 2.6      | 1.0198    | [4, 3, 2, 4, 2, 3, 1, 3, 1, 3]           |
| brilliant     | 2.8      | 0.6       | [2, 3, 3, 2, 3, 3, 4, 2, 3, 3]           |
| brilliantine  | 0.8      | 1.16619   | [-1, 3, 1, 0, 1, 0, 2, 0, 2, 0]          |
| brilliantines | 2        | 1.34164   | [0, 1, 4, 2, 3, 1, 3, 0, 3, 3]           |
| brilliantly   | 3        | 0.44721   | [3, 2, 3, 3, 3, 3, 3, 3, 4, 3]           |
| brilliants    | 1.9      | 0.83066   | [3, 1, 2, 1, 2, 1, 3, 2, 1, 3]           |
| brisk         | 0.6      | 0.8       | [0, 0, 0, 0, 1, 1, 0, 2, 0, 2]           |
| broke         | -1.8     | 0.4       | [-2, -2, -2, -2, -1, -2, -2, -1, -2, -2] |
| broken        | -2.1     | 0.53852   | [-2, -2, -2, -2, -3, -2, -1, -3, -2, -2] |
| brooding      | 0.1      | 1.3       | [3, 0, -1, -1, -1, 1, 1, -1, 1, -1]      |
| brutal        | -3.1     | 0.7       | [-3, -3, -4, -2, -3, -4, -3, -4, -3, -2] |
| brutalise     | -2.7     | 1.1       | [-4, -3, -3, -4, -3, -2, -2, -3, 0, -3]  |
| brutalised    | -2.9     | 0.83066   | [-3, -3, -2, -3, -3, -4, -4, -1, -3, -3] |
| brutalises    | -3.2     | 0.4       | [-3, -3, -3, -3, -3, -4, -4, -3, -3, -3] |
| brutalising   | -2.8     | 0.74833   | [-3, -3, -4, -3, -2, -3, -3, -3, -1, -3] |

-----------------

# 8. Sentiment Analysis of Social Media

**Today, we will analyze tweets that I collected during the CoVid Pandemic in 2022**

Before we can start analyzing the sentiment of the scraped tweets, we need to do some data cleaning and preprocessing
- remove URLs from tweets
- remove # and @ from tweets
- remove reserved words (e.g., RT and FAV) from tweets
- remove links to images

- We will now look at some real-world data from Twitter that was collected from Twitter's website.
- These data (i.e., tweets) were written by real people (whom I have no influence over!)
- **Tweets can contain explicit and offensive content:**
    - Sexuality
    - Inappropriate language
    - Racism
    - Slurs

In [None]:
# 1. Load Tweets and inspect
import pandas as pd
pd.set_option('max_colwidth', 20)
# The data is in data folder - modify the path if necessary (best to load with Google Drive)
tweets = pd.read_json('coronaUSA2022.json', lines=True)
tweets.tail()

## 8.1 Remove Undesirable Characters and Strings from Tweets

Tweets can contain a range of characters, words, and strings that do not add value to our analysis. These include:

- Reserved Words like RT and FAV
- URLs
- Pictures
- Hashtags #
- Mentions @
- HTML entities, e.g.: \&amp;

*Let's remove them before we proceed with our analysis*


In [None]:
# 1. Import regular expressions
import re

# 2. Set-up patterns to be removed fro the tweets
pat1 = r"http\S+"   # web links
pat2 = r"#"         # hashtags
pat3 = r"@"         # mentions
pat4 = r"FAV"       # twitter reserved abbreviation
pat5 = r"RE"        # twitter reserved abbreviation
pat6 = r"pic.\S+"   # twitter links to images
pat7 = r"\n"        # line breaks
pat8 = '\r\n'       # line breaks
pat9 = r'|'.join((r'&amp;',r'&copy;',r'&reg;',r'&quot;',r'&gt;',r'&lt;',r'&nbsp;',r'&apos;',r'&cent;',r'&euro;',r'&pound;'))  # HTML tags

# 3. Combine all patterns
combined_pat = r'|'.join((pat1, pat2, pat3, pat4, pat5, pat6, pat7, pat8, pat9))

# 4. Replace the patterns with an empty string
tweets['stripped'] =  [re.sub(combined_pat, '', w) for w in tweets.content]

# 5. might have double spaces now (because of empty string replacements above) - remove double empty spaces
tweets['stripped'] = tweets.stripped.replace({' +':' '},regex=True)

# 6. Print some tweets to check if it worked
for i in range(0,10):
    print(tweets.stripped[i])
    print('\n')

## 8.2 Compound Sentiment Scores for all Tweets

In [None]:
# 1. Import the sentiment module (in case you haven't already done so)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# 2. Import numpy (in case you have not already done so)
import numpy as np

# 3. Instantiate the sentiment analyzer (in case you haven't already done so)
analyser = SentimentIntensityAnalyzer()

# 4. Now get the compound sentiment score for each tweet
tweets['C_Score'] = np.nan # initialize empty comlumn in our tweets dataframe (empty = missing values)
for index, row in tweets.iterrows():  # loop through all tweets (i.e., rows)
    tweets.loc[index, 'C_Score'] = analyser.polarity_scores(row['stripped'])['compound']

# 5. Let's take a look!
pd.set_option('display.max_colwidth', None)
tweets[['stripped','C_Score']][5300:5310]

## 8.2.1 Some Basic Sentiment Descriptives
Let's get a first impression of the Sentiment across the tweets we scraped

In [None]:
# 1. import necessary modules (in case not already imported)
import pandas as pd
import numpy as np

print(f"Count positive tweets: {sum(tweets['C_Score'] > 0.05)}")
print(f"Count netural tweets: {tweets['C_Score'].between(-0.05, 0.05).sum()}")
print(f"Count negative tweets: {sum(tweets['C_Score'] < -0.05)}")
print(f"Total number of tweets: {tweets['C_Score'].count()}")
print()
display(tweets.C_Score.describe())

## 8.3 X(Twitter) Sentiment EDA

Let's explore consumer sentiment some more using data visualization.   

### 8.3.1 Get a Visual Impression of the Sentiment Distribution

In [None]:
# 1. import necessary modules (in case not already imported)
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Settings for seaborn plotting style
sns.set(color_codes=True)

# 3. Settings for seaborn plot sizes
sns.set(rc={'figure.figsize':(5,5)})

# 4. Create Histogram
ax = sns.histplot(tweets['C_Score'],
                  bins=10,
                  kde=False,
                  color='skyblue')
ax.set(xlabel='Sentiment Distribution', ylabel='Frequency')

**Let's simplify our analysis** *by creating a new variable called* ***Sentiment*** that assumes the strings:
- **Postive** if the compound sentiment score (C_Score) is greater than 0.05
- **Negative** if the compound sentiment score (C_Score) is less than - 0.05
- **Neutral** if the compound sentiment score (C_Score) is between -0.05 and 0.05 (including both values)

In [None]:
# 1. Create an empty column with 'object' dtype
tweets['Sentiment'] = np.nan
tweets['Sentiment'] = tweets['Sentiment'].astype(object) # setting data type

# 2. Loop through rows of dataframe and determine strings for new column "Sentiment"
for index, row in tweets.iterrows():
    if tweets.loc[index, 'C_Score'] > 0.05 :
            tweets.loc[index, 'Sentiment'] = "Positive"
    elif tweets.loc[index, 'C_Score'] < -0.05 :
            tweets.loc[index, 'Sentiment'] = "Negative"
    else :
        tweets.loc[index, 'Sentiment'] = "Neutral"

# 3. Typecast as categorical variable (computationally more efficient)
tweets['Sentiment'] = tweets['Sentiment'].astype("category")

In [None]:
# 4. Check that it worked
tweets[['stripped','C_Score', 'Sentiment']][5300:5310]

### 8.3.2 Visualize the Sentiment Category Shares in a Donut Chart

In [None]:
# 1. Import necessary modules (in case not already imported)
import matplotlib.pyplot as plt

# 2. Set font size
plt.rcParams['font.size']=24

# 3. Define figure
fig, ax = plt.subplots(figsize=(9, 6), subplot_kw=dict(aspect="equal"))

# 4. Get count by sentiment category from tweets_df
sentiment_counts = tweets.Sentiment.value_counts()
labels = sentiment_counts.index

# 5. Define colors
color_palette_list = ['lightgreen', 'lightblue', 'red','orange']

# 6. Generate graph components
wedges, texts, autotexts = ax.pie(sentiment_counts, wedgeprops=dict(width=0.5), startangle=-40,
       colors=color_palette_list[0:3], autopct='%1.0f%%', pctdistance=.75, textprops={'color':"w", 'weight':'bold'})

# 7. Plot wedges
for i, p in enumerate(wedges):
    ang = (p.theta2 - p.theta1)/2. + p.theta1
    y = np.sin(np.deg2rad(ang))
    x = np.cos(np.deg2rad(ang))
    horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
    connectionstyle = "angle,angleA=0,angleB={}".format(ang)
    ax.annotate(labels[i], xy=(x, y), xytext=(1.2*x, 1.2*y),
                horizontalalignment=horizontalalignment)
# 8. Set title
ax.set_title("Sentiment Distribution", y=.95, fontsize = 24)

# 9. Show Doughnut Chart
plt.show()

### 8.3.3 Sentiment Distribution over Time

Did the sentiment towards corona change over the course of a week?

In [None]:
# 1. Import required package
import math

# 2. New column that holds days (sorted)
tweets['day'] = [one.date() for one in tweets['date']]
tweets = tweets.sort_values(by=['day'])

# 3. Create props (stacked bars) for sentiment grouped by day (as % shares)
sentiments = ["Positive", "Neutral", "Negative"]
positiveProps = (tweets[tweets.Sentiment == 'Positive'].groupby(['day']).count()[['Sentiment']]/ tweets.groupby(['day']).count()[['Sentiment']])*100
neutralProps = (tweets[tweets.Sentiment == 'Neutral'].groupby(['day']).count()[['Sentiment']]/ tweets.groupby(['day']).count()[['Sentiment']])*100
negativeProps = (tweets[tweets.Sentiment == 'Negative'].groupby(['day']).count()[['Sentiment']]/ tweets.groupby(['day']).count()[['Sentiment']])*100

# 4.Turn props into lists
positiveProps = positiveProps['Sentiment'].tolist()
neutralProps = neutralProps['Sentiment'].tolist()
negativeProps = negativeProps['Sentiment'].tolist()

# 5. Set-up plot
plt.figure(figsize=[24, 8])
barWidth = 0.5
labels = tweets.day.unique()
r = np.arange(len(labels))

# 6. Set values to zero if missing
positiveProps = [0 if math.isnan(x) else x for x in positiveProps]
neutralProps = [0 if math.isnan(x) else x for x in neutralProps]
negativeProps = [0 if math.isnan(x) else x for x in negativeProps]

# 7. Define appearance of bar plot
plt.bar(r,positiveProps, color='lightgreen', edgecolor='white', width=barWidth)
plt.bar(r, neutralProps, bottom=positiveProps, color='skyblue', edgecolor='white', width=barWidth)
plt.bar(r, negativeProps, bottom=[i+j for i,j in zip(positiveProps, neutralProps)], color='red', edgecolor='white', width=barWidth)

# 8. Additional plot settings and style
plt.xticks(r, labels, rotation = 45, fontsize=12)
plt.yticks(fontsize=16)
plt.suptitle('Sentiment Distribution over Time')
plt.xlabel("Date", fontsize=18)
plt.ylabel("Share", fontsize=20)
plt.legend(sentiments)
plt.show()

# 9. Sort by Index again to restore orignal order of tweets (since we had grouped and sorted them differently for this part)
tweets.sort_index(inplace=True)

## 8.4 Topic Search in Social Media Posts

It can be important to understand what topics are discussed on social media. However, discovering topics from thousands of tweets is not a trivial task!

- Human Approach: Have people read tweets and determine what the main topics are
- Automated Approach: Use Data Science to discover topics
    - Search for tweets that contain certain text / words / strings: **Topic Tagging**
    - Visualize Word Frequencies: **World Clouds**
    - Use **Deep Learning** (coming attractions - stay tuned to this course!)

### 8.4.1 Topic Tagging

1. Define a list of words that are commonly associated with a topic of interest
2. Seach for those words in all tweets
3. Identify those tweets that contain one or more of the "topic words"

In [None]:
# 1. Import required modules (in case not already imported)
import numpy as np
import re

# 2. Let's try to identify tweets that are about beer
#    You need think for words that would be indicative of beer, that is, that make it likely that the tweet is about corona the beer.
tweets['Beer'] = tweets.stripped.str.contains('(?:^|\W*)(?:beer|beers|drink|party|beach|lime|"corona extra")(?:$|\W*)',
    flags = re.IGNORECASE).astype(int)

# 3. Let's try to identify tweets that are about the virus
#    You need think for words that would be indicative of the corona virus, that is, that make it likely that the tweet is about corona the virus.
tweets['Virus'] = tweets.stripped.str.contains('(?:^|\W*)(?:virus|"covid-19"|death|pandemic|mask|hoax|vaccine)(?:$|\W*)',
    flags = re.IGNORECASE).astype(int)

# 4. How many tweets of each topic?
print(f"Total {tweets['Beer'].count()}")
print(f"Beer {tweets['Beer'].sum()}")
print(f"Virus {tweets['Virus'].sum()}")

In [None]:
# 5. Make Pandas Columns wider so we can see all the tweet texts easily
pd.set_option('display.max_colwidth', None)

# 6. Show tweets with their repsective topic labels
tweets[['stripped','Beer','Virus']].head(10)

In [None]:
# 7. Show the tweets that are about beer / virus
select_tweets = tweets.loc[tweets['Beer'] == 1, 'stripped'].values[:]
for w in select_tweets[0:10]:
    print(w)
    print('\n')

### 8.4.2 Word Clouds from Tweets

for more details see: https://www.datacamp.com/community/tutorials/wordcloud-python

In [None]:
# Install WordCloud on your local comouter (already installed on CoLab)

In [None]:
# 1. Import module
from wordcloud import WordCloud

# 2. Define what we are looking for:
#sent = 'Positive'
#sent = 'Neutral'
sent = 'Negative'

clmn = 'Beer'
#clmn = 'Virus'

# 3. How many Tweets will contribute to Cloud?
print(f"Contributing Tweets {tweets[tweets['Sentiment'] == sent][clmn].sum()}\n")


# 4. Create bag of words for tweets of certain sentiment
all_words = ' '.join([text for text in tweets[(tweets['Sentiment'] == sent) & (tweets[clmn] == 1)]['stripped']])

# 5. Generate Word Cloud
wordcloud = WordCloud(collocations=True, width=800, height=500, random_state=5, max_font_size=110).generate(all_words)

# 6. Visulaize Cloud
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()