# Predictive Modelling I: Classification & K-Nearest Neighbors (KNN)

# 1. Roadmap from SciKit-Learn
![roadmap](https://scikit-learn.org/1.3/_static/ml_map.png)
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html  
For an overview on supervised learning methods in the scikit-learn library: https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html


# 2. Dr. D's Amazin' Grocery Store

Dr. D. launched his Amazin’ Grocery Store and immediately started to collect data on his customers (using a loyalty program where they only need to supply their phone number at check-out). He was also able to segment his existing customer base, which enables him to target them much more effectively with promotions that they might be interested in (using the appropriate medium).

The segmentation of his customers is a manual process at this time.  Dr. D. has to hire experts from a renown consulting firm (Accidenture) every couple of years for USD 3500 per consultant per day plus travel expenses. Usually, the team of 2 consultants finishes their work within a week. Accidenture discovered 3 customer segment and classifies all existing customers for Dr. D. in his customer database.

Clearly, Dr. D. would rather go on a (very!) nice vacation than pay that much money to Accidenture for the segmentation of his customers. He therefore decided that he should use the power of data science and AI to automatically assign **new customers** to the appropriate segment (after some data was collected on them).

He put **YOU** in charge of the task and provided you with a dataset of customer records. The dataset contains the following information:

- CustomerID
- Name of Customer
- Nickname of Customer
- Average monthly spending in USD (i.e., revenue)
- Average number of shopping trips to his grocery store
- Average basket size per shopping trip (i.e., number of SKUs purchased on a trip)
- The share of private label products bought by the customer
- The share of organic products bought by the customer
- Whether a customer actively uses the store’s own credit card or not
- The segment each customer was assigned to by Accidenture

**Let's see if we can save Dr. D. thousands of dollars by automating the assignment of new customers to the three customers segments using Machine Learning techniques!**

## 2.1 Data Pre-Processing

Before we can start, we need to:
- Import Dr. D's Dataset (a csv file)
- Inspect it
- Make sure the data types are suitable for our purpose
- Determine which variables are relevant for our prediction task
- Extract the response and feature variables
- Visually inspect our response variables to see what is going on in the data

# This notebook uses prompts
- In Google Colab, you can use Colab AI to create LLM-generated code.

### 2.1.1 Import Dr. D's Dataset (a csv file)

In [None]:
# 0a Connect our Google Drive and switch to the folder that contains our data
from google.colab import drive
drive.mount('/content/gdrive')

# 0b Change permanently into directory where data files are located
%cd /content/gdrive/MyDrive/488/Class17

In [None]:
# 0c See files that are in the current directory
# special shell command to view the files in the home directory of the notebook environment (! command has no lasting effect)
!ls

**A note on shell commands in python notebooks:** The difference between **!** and **%**

- **!** calls out to a shell (in a new process),
- **%** affects the process associated with the notebook (or the notebook itself)
- many **%** commands have no shell counterpart.

***!cd foo***, by itself, has no lasting effect, since the process with the changed directory immediately terminates.

***%cd foo*** changes the current directory of the notebook process, which is a lasting effect.

In [None]:
# 1a Import Pandas so we can load our data into a dataframe
import pandas as pd

# 1b import the data file
customers_df = pd.read_csv("DrDsAmazinGroceryStore1.csv") # in the /data subdirectory

# 1c Take a look at the first 5 rows
customers_df.head()

### 2.1.2 Inspect the data we have available

In [None]:
# prompt: describe customers_df

customers_df.describe()


### 2.1.3 Make sure the data types are suitable for our purpose  

- Which columns would be useful to predict which segment a customer belongs to?
- Can we use use the columns of interest in a machine learning model that predict's a customer's segment?

In [None]:
# prompt: Using dataframe customers_df: tell me about the data types in this df

customers_df.info()


In [None]:
# prompt: Using dataframe customers_df: convert a string column 'Segment' to a categorical type and then map its categories to numeric codes in a DataFrame called customers_df. Show the DataFrame's first few rows

import pandas as pd

# Convert the 'Segment' column to a categorical type
customers_df['Segment'] = customers_df['Segment'].astype('category')

# Map the categories to numeric codes
customers_df['SegID'] = customers_df['Segment'].cat.codes

# Show the first few rows of the DataFrame
customers_df.head()


In [None]:
# prompt: For SegID in df, make a dictionary that converts to Segment

# Create a dictionary to map SegID to Segment
segment_mapping = dict(zip(customers_df['SegID'], customers_df['Segment']))

segment_mapping


In [None]:
# prompt: Using dataframe customers_df: check the mean by `segment` column for organic products

customers_df.groupby(['Segment']).Organic.mean()


### 2.1.4 Extract the response and feature variables

- We only want to use the variables of interest for our prediction of customer segments.
- These are the:
    - average monthly revenue (Revenue)
    - the average number of trips per month (Trips)
    - the average basket size (BasketSize)
    - the percent share of private label products (Plabel)
    - and the percent share of organic products (organic)    

*We will ignore the variable "StoreCC" for now.*

**Let's create two arrays for our prediciton problem:**
- The first array holds the segment data (our response variable).
- The second array holds the variables of interest (our feature variables).

In [None]:
# prompt: Using dataframe customers_df: extract feature variables and a response variable SegID from a pandas DataFrame customers_df for machine learning, excluding specific columns, and display their shapes using numpy

import pandas as pd
import numpy as np

# Define the columns to exclude
exclude_cols = ['CustomerID', 'CustomerName', 'CustomerNick', 'Segment', 'StoreCC', 'SegID']

# Extract feature variables
feature_cols = [col for col in customers_df.columns if col not in exclude_cols]
X = customers_df[feature_cols]

# Extract the response variable
y = customers_df['SegID']

# Print the shapes of X and y
print('X shape:', X.shape)
print('y shape:', y.shape)


### 2.1.5 Visual inspection of the data

Let's visually inspect how the features in our dataset relate to another!

***What can we already learn by just "eyeballing" the data?***

In [None]:
# 5a Import seaborn library for visualizing data conveniently
import seaborn as sns

# 5b Construct a new dataframe that contains our response and feature variables as input to our plot
feature_df =  customers_df.drop(['SegID', 'CustomerID', 'CustomerName', 'Segment','StoreCC', 'CustomerNick'], axis=1)
response_df = customers_df['Segment']

# 5c Join the dataframes
joint_df = pd.concat([response_df, feature_df], axis=1)

# 5d Create a matrix of scatter plots
sns.set(style="ticks")
sns.pairplot(joint_df, vars=['Basket','PLabel','Organic', 'Spending','Trips'], hue='Segment')


In [None]:
# prompt: Using dataframe customers_df: visualize relationships between multiple variables in a pandas DataFrame using seaborn's pairplot, including specific features and a categorical hue
# An alternative to the above code block

import seaborn as sns

# Define the features to include in the pairplot
features = ['Spending', 'Basket', 'Trips', 'PLabel', 'Organic']

# Create the pairplot with the specified features and hue
sns.pairplot(customers_df, vars=features, hue='Segment')


# 3. Predicting Customer Segments using Machine Learning


## 3.1 We will use Supervised (Machine) Learning to solve Dr. D's problem

**Objective** of Supervised (Machine) Learning: Automate time-consuming or expensive manual tasks  

**Examples:**
- Doctor’s diagnosis
- Make predictions about the future
- Will a customer click on an ad or not?

**Requires:** Labeled data  
- Historical data with labels
- Experiments to get labeled data
- Crowd-sourcing labeled data

**Taks/Models:**
- Classification: should we target a consumer?
- Regression: how much revenue can we expect from a consumer?

**Binary vs. Multiclass Prediction**\
Today, firms largely use (are biased towards) classification models.  
The reason behind this bias towards classification models is that most analytical problems involve making a decision that requires a simple Yes/No answer:
 - Will a customer churn or not
 - Will a customer respond to ad campaign or not
 - Will the firm default or not  
In these cases, we use binary classification.

However, **it is also possible to predict multiple classes** at once! Instead of a Yes/No (i.e, positive vs. negative) prediction, a supervised classification model can also be trained to predict multiple classes (e.g., segment 1 vs. segment 2 vs. segment 3). We call this **multiclass prediction**, and we will use it to help Dr. D. segment new customers.



## 3.2 The K-Nearest Neighborhod (KNN) Machine Learning Algorithm for Classification


### *Show me who your friends are, and I’ll tell you who you are*

The concept of KNN can hardly be described more simply. This is an old saying, which can be found in many languages and many cultures.


**Basic idea:** Predict the label of a data point by  
- Looking at the ‘k’ closest labeled data points
- Taking a majority vote  

**Underlying Principle**:
- Find a predefined number (k) training samples closest in distance to a new sample that has to be classified
- The label of the new sample will be defined from these neighbors
- KNN has a fixed user defined constant for the number of neighbors which have to be determined


![KNN explained from www.python-course.eu/images/k_NN.png](http://www.python-course.eu/images/k_NN.png "KNN Intuition")


### 3.2.1 *Let's train a model that can predict to which segment a customer belongs using K-Nearest Neigbors (KNN)*

In [None]:
# prompt: split the dataset into training and test sets using scikit-learn, then instantiate and fit a KNeighborsClassifier model

# Import the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=21)

# 2 Check if our sample is split as we expected
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

# Instantiate a KNeighborsClassifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the model
knn.fit(X_train, y_train)

### 3.2.2 So how good is our model at prediction to which segment a customer belongs?

***Let's predict the segment of all customers in our test data set and check how often our model was right!***

In [None]:
# 1 Run prediction on test data
y_pred = knn.predict(X_test)
print("Test set predictions: \n {}".format(y_pred),"\n")

# 2 Calculate the accuracy of our prediction using np.mean
print("Accuracy of Predicition (Manual scoring): {:.2f}".format(np.mean(y_pred==y_test)))

# 3 Alternatively, we can use knn's internal score function
print("Accuracy of Predicition (KNN internal scoring): {:2f}".format(knn.score(X_test, y_test)))

# 4 Alternatively, we can import a library from sklearn
from sklearn.metrics import accuracy_score
print(f"Accuracy of Predicition (sklearn scoring): {round(accuracy_score(y_test, y_pred)*100,2)}%")

### 3.2.3 Predicting the segment of a new customer that we have no segment information on

- Basket_Size is 16
- Share_Private_Label is 25%
- Share_Organic is 25%
- Revenue is USD400
- Number of Trips are 6

In [None]:
# prompt: Give me a row from X_train where y_train is not Senior

import numpy as np
# Find the index where y_train is not equal to 'Senior' (assuming 'Senior' is represented by its numeric code)
# Assuming 'Senior' corresponds to SegID 2 in your data
non_senior_indices = np.where(y_train != 2)[0]

# If non_senior_indices is empty, no such row exists. Otherwise print the first such row
if len(non_senior_indices) > 0:
  first_non_senior_index = non_senior_indices[0]
  print(X_train.iloc[first_non_senior_index])
else:
  print("No rows found where y_train is not 'Senior'")


In [None]:
# prompt: If I give a new customer X data [50, 25, 25, 200, 6], what are those categories?

import numpy as np

# Assuming the order of features is: Spending, Basket, Trips, PLabel, Organic
new_customer_data = np.array([[200, 6, 50, 25, 25]])

# Load the pre-trained KNN model (you need to have trained it beforehand)
# ... Load your model from the previous code block, e.g., using joblib ...

# Predict the segment for the new customer
predicted_segment_id = knn.predict(new_customer_data)

# Map the predicted segment ID back to the segment name
segment_mapping = {0: 'Budget', 1: 'Mainstream', 2: 'Senior'} # Replace with your actual mapping
predicted_segment = segment_mapping[predicted_segment_id[0]]

print(f"The predicted segment for the new customer is: {predicted_segment}")


In [None]:
# prompt: predict the customer segment for a new record using a pre-trained KNeighborsClassifier model, and translate the numeric prediction back to a segment name using a dictionary

# Create a new record
x_new = np.array([[50, 25, 25, 200, 6]])

# Predict the customer segment using the pre-trained model
prediction = knn.predict(x_new)

# Create a dictionary to translate numeric prediction to segment names
replace_map = {'Segment': {2: 'Yuppies', 1: 'Seniors', 0: 'Families'}}

# Print the predicted segment name
print("Predicted Segment:", replace_map.get('Segment', {}).get(prediction[0]))


# 4. Improving our Prediction

**So how can we do better in our prediction?**  
There are a few things we can easily change:
- The distribtution of segments (labels) within the train and test samples
- The size of the testing sample
- The value for k
- The scales of the input data

## 4.1 Even Distribution of Labels
- We want to have the same distribution of labels in our training and testing sets
  - use stratify by dependent variable (here, y)

In [None]:
# prompt: Show how to split a dataset into training and test sets, train a KNeighborsClassifier, make predictions on the test set, and calculate accuracy using scikit-learn

# 1. Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=21, stratify=y)

# 2. Instantiate the KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)

# 3. Fit the model
knn.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred = knn.predict(X_test)

# 5. Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 6. Print the accuracy
print(f"Accuracy: {accuracy}")


## 4.2 Size of Test Set
- The larger the testing set, the less data the classifier has to train on
  - Make test set smaller: How much?

In [None]:
# prompt: Explain how to prepare a dataset for machine learning by splitting it into training and test sets, training a KNeighborsClassifier with 4 of neighbors, predicting on the test data, and computing the model's accuracy, all using scikit-learn

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

# Instantiate the KNeighborsClassifier with 3 neighbors
knn = KNeighborsClassifier(n_neighbors=4)

# Train the model on the training data
knn.fit(X_train, y_train)

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Calculate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print(f"Accuracy: {accuracy}")


## 4.3 Overfitting and Underfitting: Finding "k"

Changing k leads to different results. So what is the *right* value for k?

**Let's make the impact of setting k to different values visible:**
- Show boundaries of each class (i.e., segment) in a graph
- These "Decision Boundaries" separate our three segments from another

### 4.3.1 Visualizing Decision Boundaries for KNN

In [None]:
# 0 Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import minmax_scale
from matplotlib.colors import ListedColormap
from sklearn import neighbors

# 1 First split the sample into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

# 2 Select features and set step size
X2 = minmax_scale(X_train.iloc[:, [0, 1]].values)  # .values to ensure it's an array
y2 = y_train
h = .01  # step size in the mesh

# 3 Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

# 4 Plot for different values of k
for k in [2,4,8,16,32,99]:
    # 4a We create an instance of Neighbours Classifier and fit the data.
    clf = neighbors.KNeighborsClassifier(n_neighbors=k)
    clf.fit(X2, y2)

    # 4b Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X2[:, 0].min() - .1, X2[:, 0].max() + .1
    y_min, y_max = X2[:, 1].min() - .1, X2[:, 1].max() + .1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # 4c Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(10,5))
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light, shading='auto')

    # 4d Plot also the training points
    plt.scatter(X2[:, 0], X2[:, 1], c=y_train, cmap=cmap_bold)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification (k = %i)"
              % (k), fontsize=20)

plt.show()

### Study the above graphs carefully. What do you see as k get's larger?

- As k gets larger, the boundaries become smoother
- As k approaches the number of customers, the whole graph will take on a single color

### --> Model Complexity

- Larger k = smoother decision boundary = less complex model
- Smaller k = more complex model = can lead to overfiting

### 4.3.2 Overfitting and Underfitting

#### The goal of a good machine learning model is to generalize well from the training data to any data from the problem domain.
- This allows us to make predictions in the future on data the model has never seen.

A model can be poorly trained by overfitting or underfitting the data:

- **Overfitting**
    - Happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.   
    - An overfit model is one where performance on the train set is good and continues to improve, whereas performance on the validation set improves to a point and then begins to degrade.  
    

- **Underfitting**
    - Refers to a model that can neither model the training data nor generalize to new data.   
    - If a model cannot generalize well to new data, then it cannot be leveraged for classification or prediction tasks.
    - Generalization of a model to new data is ultimately what allows us to use machine learning algorithms every day to make predictions and classify data.
    - High bias and low variance are good indicators of underfitting.

We can test to what extend different values of k overfit or underfit the data.  

We proceed as follows:
1. Compute and plot the training and testing accuracy scores for a variety of different neighbor values (k).
2. Inspect how the accuracy scores differ for the training and testing sets with different values of k

In [None]:
# prompt: evaluate and plot the training and testing accuracy of a KNeighborsClassifier for varying numbers of neighbors

# 1. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=21, stratify=y)

# 2. Setup arrays to store train and test accuracies
neighbors = np.arange(1, 25)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# 3. Loop over different values of k
for i, k in enumerate(neighbors):
    # 3a Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors=k)

    # 3b Fit the classifier to the training data
    knn.fit(X_train, y_train)

    # 3c Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    # 3d Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# 4. Generate plot
plt.figure(figsize=(15,8))
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend(fontsize=15)
plt.xlabel('Number of Neighbors', fontsize=15)
plt.ylabel('Accuracy', fontsize=15)
plt.title('k-NN: Varying Number of Neighbors', fontsize=20)
plt.show()


In [None]:
# prompt: instantiate, fit, and evaluate a KNeighborsClassifier with an optimal number of neighbors on a test dataset, including calculating the model's accuracy

# Instantiate the KNeighborsClassifier with an optimal number of neighbors
knn = KNeighborsClassifier(n_neighbors=4)

# Fit the model
knn.fit(X_train, y_train)

# Make the prediction for the test set
y_pred = knn.predict(X_test)


# Print the accuracy
print("Accuracy of Predicition: {:2f}".format(knn.score(X_test, y_test)))

## 4.4 It appears that the features that may drive the segment memberships of customers are on different scales.

What happens when we re-scale them to the same scale?

In [None]:
# Assuming X is a pandas DataFrame
X_scaled = minmax_scale(X)

# Adjusted plotting code to use .iloc for integer-location based indexing
fig, ax = plt.subplots(nrows=5, ncols=2, figsize=(16,16))
sns.histplot(X_scaled[:,0], ax=ax[0,0], color='y')
ax[0,0].set_title("Scaled Data", fontsize=20)

# Use .iloc for indexing if X is a DataFrame
sns.histplot(X.iloc[:,0], ax=ax[0,1])
ax[0,1].set_title("Original Data", fontsize=20)
sns.histplot(X_scaled[:,1], ax=ax[1,0], color='y')
sns.histplot(X.iloc[:,1], ax=ax[1,1])
sns.histplot(X_scaled[:,2], ax=ax[2,0], color='y')
sns.histplot(X.iloc[:,2], ax=ax[2,1])
sns.histplot(X_scaled[:,3], ax=ax[3,0], color='y')
sns.histplot(X.iloc[:,3], ax=ax[3,1])
sns.histplot(X_scaled[:,4], ax=ax[4,0], color='y')
sns.histplot(X.iloc[:,4], ax=ax[4,1])
plt.show()

# Technical Note: that the y-axis in a density plot is the probability density function for the kernel density estimation.

### 4.4.1 Let's build/train our KNN model again - this time with the re-scaled variables

In [None]:
# 1 Split sample into train and test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.4 ,random_state=21, stratify=y)

# 2 Import the k-nearest neighbors classifier from sci-kit learn
from sklearn.neighbors import KNeighborsClassifier

# 3 Instantiate the KNeighborsClassifier with a n_neighbors value of 4
knn = KNeighborsClassifier(n_neighbors=4)

# 4 Fit the model
knn.fit(X_train, y_train)

# 5 Make the prediction for the test set
y_pred = knn.predict(X_test)

# 6 And calculate the accuracy
print("Accuracy of Predicition: {:2f}".format(knn.score(X_test, y_test)))

In [None]:
# prompt: Show how to prepare scaled data for machine learning by splitting into training and test sets, train a KNeighborsClassifier with a specified number of neighbors, make predictions on the test data, and compute the model's accuracy

# Split the scaled data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.4, random_state=21, stratify=y)

# Instantiate the KNeighborsClassifier with 4 neighbors
knn = KNeighborsClassifier(n_neighbors=4)

# Train the model
knn.fit(X_train, y_train)

# Make predictions on the test data
y_pred = knn.predict(X_test)

# Compute the model's accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print("Accuracy: {}".format(accuracy))