# Deep Learning with Neural Networks

# Large Language Models

> Large Language Models (LLMs) are at the intersection of NLP (Natural Language Processing) and Deep Learning.   
They can generate human-like text responses.

* **LARGE**: Training data, required computing power, number of model parameters
* **LANGUAGE**: Human-like text
* **MODEL**: Simplification of some complex phenomenon

![Source: DataCamp](https://mapXP.app/MBA742/AI_ML_DL_NLP_LLM.png "Large Language Models")


# 1. The Road to Deep Learning

## 1.1 Models

> A model is a simplification of some complex phenomenon.

![Car vs Car Model](https://mapXP.app/MBA742/model.png "Model")

### ***A model of a car is a smaller, simpler version of that car.***

## 1.2 Machine Learning

*  Machine learning is the study of algorithms that allow computer programs to automatically improve through experience.
* ML creates behavior by taking in data, forming a model, and then executing the model
* Hard to model language (or many other phenomena) with a bunch of rules (e.g., if-else statements)
>We use algorithms that can find patterns in data to model a phenomenon

### ***If we can create a model of a car, then we might be able to create a model of language***
> We use machine learning to create smaller, simpler versions of human language

# 1.3 Neural Networks

> **Neural Networks are one way to learn a model from data**

* Idea is roughly based on how the human brain is made
  * Network of interconnected brain cells called ***neurons***
  * Neurons ***pass electrical signals*** back and forth
  * Somehow ***allowing us to do*** all the ***things*** we do.
* Frank Rosenblatt invented Artificial Neutal Networks that comprise multiple layers of neurons to "learn" patterns in the 1950s.
    * Decades later, we figured out how to train them.
* Neural networks are very inefficient.
    * Decades later, we had powerful enought hardware to train them at scale




### 1.3.1 Self-Driving Car with Neural Networks

#### **We will use a toy example to explore how neural networks work: no math involved!**

* You want to build self-driving car and need:
  * **Sensors**: proximity sensors on front, back, and side [ report 1.0 if something is close, else 0.0]
  * **Servos**: little motors that can push the accelerator and brakes [ 1.0 max push, 0.0 no push],   
  and turn the steering wheel [-1.0 full left, 1.0 full right]
  * **Electrical wiring**: connect all components.
  * **Driving Data**: Recordings how how people drive
    * Accelerate when clear
    * Break when there is obstruction
    * Steer left or right to change lanes
    * Combinations of all these (in sequence and/or simultaneously)


### 1.3.2 Wiring-up the Neural Network

#### **Install your Self-Driving System**
* Need to connect sensors to servos in your car.
* Unclear how to wire sensors and servos exactly.
> Connect ***every sensor with every servo***

![NN for self driving car](https://mapXP.app/MBA742/MRiedl_self-drive1.jpg "Mark Riedl Self Driving Car Example")

#### **Take car for a drive**
* Same electricity flows from all sensors to all servos
* Brake, accelerate, and steer equally all at once
> CRASH!


![NN for self driving car](https://mapXP.app/MBA742/MRiedl_self-drive2.gif "Mark Riedl Self Driving Car Example")


#### **How do we fix it?**

> **Fix 1**: Need electrical signal to ***flow more freely between certain sensors and certain servos*** than others.  
  * We want electrical signal to flow more freely from the front proximity sensors to the brakes and not to the steering wheel.

> **Fix 2**: Need to ***control signal strength*** to each servo ***conditional on the signal strength of the sensor***.
  * We want to send a stronger signal to the accelerator when the signal from the proximity sensor is low.  

> **Fix 3**: May need to ***combine the signals from multiple sensors*** and ***process these some more*** to better address the right servos.  
  * We want the steering servo to turn left and the brake to engage when the front and right proximity sensors both send a signal.  

#### ***That's a lot of moving parts to worry about (no pun inteded) !!!***


#### **Neural Networks to the Rescue!**
Add layers between the sensors and servos...

#### **Set-up**
* Create an Input Layer of ***4 NEURONS***: Each sensor gets hooked up to it's own *Neuron*
* Create an Output Layer of ***3 NEURONS***: Each servo gets hooked up to it's own *Neuron*
* Connect ***all NEURONS*** (with "wires")

#### **Implement Fix 1**
***Allow for different signal strengths between *Neurons****
  > Introduce ***WEIGHTS*** between connected *Neurons*:   
  * more weight = stronger signal passed forward
  * less weight = weaker signal passed forward

#### **Implement Fix 2**
***Modify signal strength passed to next Neuron based on the received signal***
  > Add ***GATES*** "between" *Neurons*  
  * The "rules" of how a gate operates are called its ***ACTIVATION FUNCTION***
  * An ***ACTIVATION FUNCTION*** is a mathematical function applied to the output signal of a *Neuron* before it is passed forward to the next *Neuron(s)*.   
      * *Example*: the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function) (which squashes outputs to a range between 0 and 1).
      * Note: An *activation function* can reduce the signal to zero, effectively "switching off" the *Neuron* in the network
  * We can adjust these "rules" (*activation functions*) with what we call ***BIASES*** in Neural Networks.

#### **Implement Fix 3**
***Combine signals and process further for more nuanced driving***
  > Insert ***HIDDEN LAYERS*** of additional *NEURONS* between Input and Output Layer:   
  * Capture Non-Linearity
  * Enable Feature Abstraction and Hierarchical Learning
  * Increase Flexibility and Functionality (more nuanced patterns)


![NN for self driving car](https://mapXP.app/MBA742/self-drive3.jpg "Adapted from Mark Riedl Self Driving Car Example")

### 1.3.3 Learning to Self-Drive

**So far we**:
1. Set-up our Neural Network based on inputs (sensors) and outputs (servos)
2. Decided on the number of hidden layers (here, 1) and number of *Neurons* in each hidden layer (here, 5).
3. Defined *Activation Functions*
   * Typically same for all hidden layers.
   * For output layer often different to match objective of the Neural Network.

These are commonly called the model's ***HYPERPARAMETERS***: these are decisions made by the model creator that affect the model's internal performance but that are not evident from the output.   

**Signals from sensors are now passsed forward through our Neural Network**
* We call this ***FEEDFORWARD PROPAGATION***
* It's a one-way street

**We can change how signals are passed forward by adjusting**:
* Weights (of Fix 1)
* Biases (of Fix 2).  

These are typically referred to as the model's **PARAMETERS** that need to learned.

**We learn weights and biases**
* From our collected driving data
* By adjusting them such that the servos are operated (correctly) based on the sensor inputs
* Using what is called ***BACKPROPAGATION***
>  Backpropagation is a training algorithm used to optimize the weights and biases by minimizing the difference between the network's predicted output and the actual target values.
(Parameters are adjusted by the model; hyperparameters by the designer.)



![NN for self driving car](https://mapXP.app/MBA742/self-drive4a.gif "Adapted from Mark Riedl Self Driving Car Example")

## 1.4 Deep Learning

Deep Learning is an **advanced subset of machine learning** that involves
* the use of *Neural Networks*
* with ***multiple*** *Hidden Layers*

> For our self-driving car example, deep learning is analogous to ***significantly increasing the complexity and depth*** of the circuitry.

**Deep Neural Networks**
* apply more varied non-linear transformations (activation functions)
* convolutional filters (in CNNs)
* recurrent connections (in RNNs)

and more, to process and learn from data.   


**Deep learning models typically learn from data by**
* adjusting the network's parameters (weights and biases),
* incrementally using backpropagation and gradient descent (or its variants),
* similar to shallow neural networks,
* but with added complexity of adjustments across many more layers.


# 2. Language Models

We use a ***Neural Network*** to manipulate the acceleration, braking, and steering of a self-driving car ***the same way humans*** did in the recorded driving data.

> **Let's treat language the same way!**

**Task:** Build a Neural Network
  * based on text written by humans,
  * that outputs a sequence of words,
  * which looks like word sequences produced by humans.

***From Car to Language:***
* Sensors (Input) ---> Words (tokens)
* Servos (Output) ---> Words (tokens)

> ***Given a bunch of input words (tokens), predict the right output word (token)***

> ### "Once upon a ____ "  
>* *time* or
>* *goat* [?](https://youtu.be/Ud_lhpOqZzY?si=s_Fm_mZWa68KGvWu)

>***Which word (token) is more likely?***
>
>#### $P(\text{time} \,|\, \text{once}, \text{upon}, \text{a})$.
>
>*which generalizes to*
>
>#### $P(\text{word}_n \,|\, \text{word}_1, \text{word}_2, \ldots, \text{word}_{n-1})$

## 2.1 Building a Language Neural Network


> ### Imagine an old-fashioned typewriter with keys and striker arms:
![Words Typewriter](https://mapXP.app/MBA742/tyepwriter_words.jpg "Words Typewriter")

#### **HOWEVER**: Instead of a **Key** and **Striker** for ***each Letter***, we have them for ***each Word*** (w)

- ***WORDS*** (or parts of them) can also be referred to as ***TOKENS***
- Let's assume the English language had 50,000 words (or tokens): the vocabulary
> That's a HUGE Typewriter!

#### Neural Network Set-up
* Typewriter **Keys** ==> **Input Layer** with 50,000 Neurons; one for each word (token)
* Typewriter **Strikers** ==> **Output Layer** with 50,000 Neurons; one for each word (token)
* All Keys to all Strikers ==> **Wiring**

#### Task
* Pick the correct Striker given the words punched into the Keys.

> For a typed phrase that triggers Neurons of the Input Layer, send strongest signal to the Neuron on the Output Layer that corresponds to the word to complete the phrase.

#### Challenge
* Even a simple language model that takes in a single word to predict a single word requires:
  * 50,000 Neurons on the Input Layer
  * 50,000 Neurons on the Output Layer
  * 50,000 x 50,000 = **2.5 billion wires between layers**

![NN Language Typewriter](https://mapXP.app/MBA742/MRiedl_LangTyp1.gif "credit: Mark Riedl")

#### **That is HUGE!**
*More bad news...*

* To fill the missing word (blank) in "Once upon a ___" we need to consider all three words in the input:
  * 50,000 x 3 = 150,000 Neurons on Input Layer
  * 50,000 Neurons on Output Layer
  * 150,000 x 50,000 = **7.5 billion wires between layers**
![NN Language Typewriter](https://mapXP.app/MBA742/MRiedl_LangTyp2.gif "credit: Mark Riedl")

> Many LLMs like ChatGPT can take in 4,000 tokens (words), not just 3 as illustrated above. ***Clearly, they must have found a better solution than our Neural Network set-up.***

#### We will break-up the problem into two parts (or circuits in our Neural Network):
* **Encoder** part (capture Input)
* **Decoder** part (generate Output)

# 2.1 Encoders in Language Models

#### Consider the the following sentences that are each missing a word:
1. The king sat on the ___
2. The queen sat on the ___
3. The princess sat on the ___
4. The regent sat on the ___

> My guess is as good as yours:
 * Maybe ***throne***?

#### Do I really need separate wires between royals and throne?
* king and throne
* queen and throne
* princess and throne
* regent and throne

#### **Idea**: Use an intermediate "*concept*" that approximately means "*royalty*"
* Every time we see king or queen (etc.) we use this intermediate "concept"
* Enough to know what to do when this intermediate "concept" appears:  
  ==> **Send a strong signal to "throne"** *(i.e., its associated Neuron)*.


#### Let's **Operationalize** the Idea in a Neural Network
>* Set-up Input Layer with 50,000 Neurons (one for each word/token)
>* Set-up an intermediate (hidden) layer with only 256 Neurons
    
>* No longer try to trigger just one Neuron (single striker in former output layer of 50,000 Neurons)
>* We mash things up by triggering multiple Neurons on the intermediate layer of 256 Neurons

>* Each possible combination of triggered Neurons could represent a differnt "concept" (like royalty or idian food or hoofed mammals).
>* 256 Neurons allow us to represent $2^{256} = 1.15 \times 10^{78}$ "concepts".

![NN Language Typewriter](https://mapXP.app/MBA742/MRiedl_LangTyp3.gif "credit: Mark Riedl")
  
* We can capture even more "concepts" when we consider the "intensity" by which each Neuron gets triggered
  * In my driving experience, alternating between flooring it or keeping the foot entirely of the accelerator or brakes (Known as [bang-bang control](https://en.wikipedia.org/wiki/Bang%E2%80%93bang_control)) is not very wise.
  * Instead of binary values [0,1], we can use decimals (floats) from -1 to 1 (e.g., 0.4) for triggering each of the 256 Neurons
  * When the model gets a word (token) as input, it triggers 256 Neurons, each with a different "intensity".

#### Summing up
* Before, 1 word (token) required 1 of 50,000 Neurons (strikers) to be triggered (rest remains passive).
* Now, 1 triggered Neuron (striker) and 49,999 passive Neurons are reduced to 256 numbers (of varying magnitude) that capture a "concept"
  * For the word "king" it might be   [-0.2, **0.3**, ..., 0.0, 0.6]
  * For the word "queen" it might be [-0.2, **0.4**, ..., 0.0, 0.6].
  
> We call these *vectors* of 256 numbers ***Embeddings***.  
> Key trick: create them by exposing some of the ***hidden state*** inside a neural network.

---
#### We call the neural network that compresses the 50,000 Neurons (typewriter keys, each representing a word) into 256 Neurons an **Encoder**.
---

## 2.2 Decoders in Language Models

The **Encoder** stops with the **Embeddings** that capture *latent* "concepts" from the input.   
> *Latent* because we do not directly create or interpret them: Embeddings are vectors of numbers generated by the Neural Network.  

The **Encoder** reduces words to latent concepts, but does not predict which word most likely comes next.

> We need something that predicts words based on the states (values for each) of the 256 Neurons in that intermediate layer.

The **Decoder** does exactly that:
* Uses the ***Embedding*** generated by the ***Encoder*** (vector of 256 values)  
* to ***trigger the original 50,000 Neurons*** of the output layer (strikers of the typewriter, one for each word/token),  
* allowing us to ***pick the word*** (token) associated to the Neuron in the output layer (striker) ***with the strongest signal***.


![NN Language Typewriter](https://mapXP.app/MBA742/MRiedl_LangTyp4.gif "credit: Mark Riedl")


## 2.3 Language Model with Encoder and Decoder

* We can ***connect*** Encoder and Decoder through that intermediate (hidden) layer of 256 Neurons
* One big Neural Network
  * receives word(token) as input,
  * encodes 50,000 inputs in (hidden) layer of 256 Neurons (vector of 256 numbers = embedding) that capture latent "concepts"
  * decodes "concepts" (hidden states of 256 Neurons = values = embedding) to output layer of 50,000 Neurons
* Use output layer to predict word (token): that with strongest signal

![NN Language Typewriter](https://mapXP.app/MBA742/MRiedl_LangTyp5.jpg "credit: Mark Riedl")


---
**Previously** required 50,000 x 50,000 = **2.5 billion** parameters (wires)

**Now** only require 2 x (50,000 x 256) = **25.6 million** parameters (wires)

---

#### How does the Encoder-Decoder Language model know:
* the signal strengths in the output layer
* conditional on the input word (token)
* so we can get $P \ (\text{W}_{out} \,|\, \text{W}_{in})$ ?

#### Need to ***adjust weights and biases*** of the network (encoder and decoder): Train the encoder-decoder neural network on text.

## 2.4 Training Language Models via Self-Supervised Learning

#### **How to find the right embedding** (vector of 256 values) for every word (token) in our vocabulary (of 50,000 words) such that:
* embredding for "king" is similar to "queen",
* but quite different from "goat"

#### Let's **start with a simple problem**
* Encoder-Decoder neural network
* accept a single word (token) as input
* to produce the exact same word (token) as output

![NN Language Typewriter](https://mapXP.app/MBA742/MRiedl_LangTyp6.gif "credit: Mark Riedl")

1. Send the word "king" into the neural network
2. Encoder creates embedding vector of 256 values in the middle
3. Decoder sends strongest signal to same word "king" based on the embedding

> **No guarantee that "king" gets strongest signal**
> * Maybe "goat" gets a stronger signal (or probability) than "king".
> * We actually don't care about the signal to "goat"
> * Look at signal to "king" and see that it is not max (i.e., probability of 1.0)
> * **How big is the error?** |Expected signal strength - actual signal strength| = ***Loss***

#### Need to minimze ***Loss*** through training
* Use ***BACKPROPAGATION***
>  Backpropagation is a training algorithm used to optimize the weights and biases by minimizing the difference between the network's predicted output and the actual target values.
* Consider all words (tokens) of vocabulary during training (i.e., minimze loss across all 50,000 words)
  * Encoder ***must compromise*** because 256 neuron hidden layer is much smaller than vocabulary
  * Some words may end up with very similar (or even same) ***embeddings***

#### Good enough outcomes
* Embeddings of "king" and "queen" become be very similar
* Embeddings of "king" and "queen" become different from "goat"

> Gives **Decoder** *a better chance to send the strongest signal to the correct word* (its associated Neuron) based on the 256 dimensional embedding.

> We accept that "king" gets a signal with strength 0.62 and "queen" with 0.61
>  * as long as all other 49,998 words (their Neurons) get a weaker signal.
* In other words, we are probably going to be okay when our Language Model confuses kings and queens as long as it doesn't get confused between kings and goats.

#### Self-supervision via same text

> Our Language Model is **self-supervised** because it ***does not require separate outcome data*** (like the car example) for testing its output.  

> **For training**, all it needs to do is **compare its output to its input**, *which is the same text (i.e., word sequence)*.


# 3 Masked Language Models (MLMs)

#### **Basic Idea**:
* Take in a sequence of words (tokens) and generate a sequence of words (tokens).
* One (or more) of the words are randomly blanked out, that is, [MASKED].
* Predict the original [MASKED] word (token) in the output sequence, *based solely on its context* (i.e., the other words in the sequence)
* Training involves adjusting weights and biases of neural network (using backpropagation) to minimze loss in ***masked token prediction***

> Approach is widely used for ***pre-training*** language models
> * Model is trained on a very large corpus of general text (computationally very expensive)
> * Pre-trained model can then be adapted to different tasks without needing to be re-trained from scratch (we call this ***fine-tuning***)
> * Fine-tuning makes a few updates to the pre-trained model to specialize it for a task or domain (more about this in the upcoming lecture on Fine-Tuning LLMs).
> * ***BERT*** (Bidirectional Encoder Representations from Transformers) is one of the most well-known examples of a pre-trained MLM.



![NN Language Typewriter](https://mapXP.app/MBA742/MRiedl_LangTyp7.jpg "credit: Mark Riedl")



## 3.1 Generative MLM

* Special case of a masked language model that always masks the last word (token) for a sequence.
* Called ***generative model*** because it is trained to predict (or generate) only the next word (token)
* This is different from bidirectional models, like BERT, which can predict a word (token) at any position on a word sequence

>#### The [MASK]
>#### The king [MASK]
>#### The king sat [MASK]
>#### The king sat on [MASK]
>#### The king sat on the [MASK]

* Also referred to as an ***auto-regressive model***
  * auto = self
  * ***self-predictive model***

#### **Generative process**:
1. model predicts the next word from the previous word sequence
2. the predicted word is added to end of sequence
3. the subsequent word is predicted from updated sequence, and so on.

> #### GPT (***Generative Pre-trained Transformer***) is a generative MLM.



## 3.2 Transformer Models

> #### A **transformer** is a deep learning model that transforms the encoding in a particular way to make predicting the [MASKED] word easier.

* Introduced in the paper "[Attention is All You Need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)" (Vaswani et al., 2017) by Google researchers.
* A deep learning model for NLP that revolutionized how sequences are processed.

#### **Core Architecture:**
* Based on an ***encoder-decoder*** architecture.
* Enhances encoding with ***self-attention***, *allowing each input part to relate to selected others.*

#### **Decoder:**

* Generates output using encoded data and self-attention, ***focusing on relevant input parts for each output step***.

#### **Significance:**

* ***Parallel Processing***: Unlike previous language models (e.g., recurrent neural networks; RNNs and their variants), Transformers process entire sequences simultaneously, making them more efficient and suitable for parallel computation.
* ***Context-Awareness***: Transformers capture deeper understanding of the context and relationships within the data, significantly improving performance on tasks like translation, question-answering, and text summarization.

![Transformer Architecture](https://mapXP.app/BUS488/insideTransformers "credit: Daniel Ringel, 2024")

## 3.3 Self-Attention

> #### Self-attention is a mechanism within the Transformer model that enables each word in a sentence to dynamically influence and be influenced by other words, capturing their interdependencies regardless of their positions.

![NN Language Typewriter](https://mapXP.app/MBA742/MRiedl_LangTyp8.jpg "credit: Mark Riedl")

#### **The Idea of Self-Attention**:
Imagine the sentence:  

      The alien landed on earth because it needed to hide on a planet.

* If some words were missing, you'd guess them based on the other words.
* Self-attention works similarly: It helps the model figure out how words in a sentence relate to each other (to "understand" the full context).

#### **How It Works**:
* If *alien* were missing, the model ***uses clues from related words*** like *landed* and *earth* to guess it.
* Similarly, if *it* were missing, knowing that *alien* is previously mentioned helps the model decide that *it* likely refers to *alien*, not *earth* or *planet*.
* This process allows the model to "understand" which words are important to each other in a sentence.

#### **Contextualized Word Embeddings**:
* For each word, the model creates a special, new blend of numbers (i.e., a contextualized embedding)
  * captures not just the word itself,
  * but also its connection to other words.  

  ***Example:*** By combining the numeric representations (word embeddings) of “alien”, “landed”, and “earth”,   
  the model creates a new, enriched (contextualized) embedding.
* New (contextualized) embedding doesn't directly match any single word.
* New (contextualized) embedding carries pieces of all related words, giving a fuller picture of the sentence.

#### **Why It Matters**:
* Enriched understanding helps the model make better predictions
(especially when filling in missing words).
* The model sees not just the words themselves, but also how they fit together in the story.
* When the model encounters a blank (like in a fill-in-the-blank question), it uses the whole sentence's context, not just guesswork.

#### **Practical Takeaway**:
* Think of self-attention as the model's way of reading between the lines
   * using the context provided by all words
   * to understand each word's role and meaning better.
* This is crucial for tasks like translating languages or answering questions where understanding context and nuances is key.

### 3.3.1 Calculating Self-Attention with Queries, Keys and Values
*without the math*

**Step 1**: Imagine every word in a sentence gets
* a special set of ***glasses*** (called queries),
* a name ***tag*** (called keys),
* and a ***backpack*** (called values).

**Step 2**:
* Each word looks at every other word through its ***glasses***.
* How clearly it sees another word's ***name tag*** tells it how important that word is to it.

**Step 3**:
* Based on importance, each word takes a little from the others' ***backpacks***
* The more important a word is, the more it takes.

**Step 4**:
* Every word ends up with a mix of stuff from all the other ***backpacks***,
* weighted by importance (how clearly a word sees other words' ***name tags*** through its ***glasses***).
* This mix is the new, improved version of the word, considering the whole sentence.

---

* **Queries** are like glasses that help see how relevant other words are.
* **Keys** are like name tags that show a word's identity.
* **Values** are like backpacks carrying the word's content.
  > The mix each word gets is a smarter version of itself, now informed by the whole sentence.

---

#### **Dot Product Attention**:
* The self-attention mechanism's ability to assess the importance of words relative to each other is often referred to as ***dot product attention***.
* This name is derived from the specific mathematical operation it employs:
  * At its core, the model calculates the ***dot product*** between the query vector of one word and the key vector of every other word.
  * The ***dot product*** multiplies corresponding values of two vectors and sums to produce a number that serves as a measure of similarity or relevance.
  * The reason it's called ***dot product attention*** is because this operation directly influences how much focus (or attention) is given to words in the sequence.
  * A ***higher dot product indicates greater relevance or similarity***, guiding the model to pay more attention to certain words when constructing the contextualized embedding of each word.

#### **Implications of Dot Product Attention**:
  * Allows for a dynamically weighted representation of the sentence,
  * where each word's influence on another
  * is quantified by their dot product.

> Dot Product Attention is a crucial aspect that enables the Transformer to understand and represent the nuanced interplay of words within a sentence.

**NOTE**: The dot product is part of the calculation for cosine similarity. Cosine similarity normalizes the dot product by the magnitudes of the vectors, focusing solely on the directionality (angle) rather than the magnitude. (*Ok, there is math, but I did it without the calculation.*)