KNN

KNN

Introduction

notion image
K-nearest neighbors (kNN) is a supervised machine learning technique that may be used to handle both classification and regression tasks. I regard KNN as an algorithm that originates from actual life. People tend to be impacted by the people around them.
kNN classifier identifies the class of a data point using the majority voting principle. If k is set to 5, the classes of 5 nearest points are examined. Prediction is done according to the predominant class. Similarly, kNN regression takes the mean value of 5 nearest locations.
The distance between data points is measured. There are various techniques to estimate the distance. Euclidean distance (Minkowski distance with p=2) is one of the most regularly used distance measurements. The graphic below explains how to compute the euclidean distance between two points in a 2-dimensional space. It is determined using the square of the difference between x and y coordinates of the locations.
notion image
notion image
 

How KNN works

  1. For an unseen data point, the algorithm calculates the distance between that point and all the observations across all features in the training dataset.
  1. It sorts those distances in ascending order.
  1. It selects K observations with the smallest distances from the above step. These Kobservations are the K-nearest neighbors of that unseen data point.
      • Note that there should be at least K≥1 observations in the dataset.
  1. It calculates which labels of those neighbors is the most common, and assigns that label to the unseen data point.

Explain KNN to non-technical

1. Look at Data Points:
  • Instead of neighbors, k-NN looks at data points or examples in a dataset. Each data point represents something, like the characteristics of a house, a person, or a product.
2. Count Their Similarity:
  • k-NN measures how similar these data points are to each other. It looks at features like size, color, or other attributes to compare them.
3. Make a Prediction:
  • When you want to know something about a new data point, k-NN finds the k data points in the dataset that are most similar to the new one.
  • It then looks at what happened with those similar data points to make a prediction about the new data point. For example, if most of the similar houses were sold for a high price, k-NN might guess that the new house will also sell for a high price.
 

How to find the optimal value of K in KNN?

Finding the optimal value of K in k-Nearest Neighbors (k-NN) involves a process called hyperparameter tuning. The goal is to determine the value of K that results in the best performance for your specific dataset and problem. Here's a systematic approach to find the optimal K:
  1. Divide the Data: Split your dataset into three parts: a training set, a validation set, and a test set. The training set is used to train the model, the validation set is used to tune hyperparameters (including K), and the test set is used for final evaluation.
  1. Choose a Range of K Values: Decide on a range of K values to explore. This range should cover a reasonable span, typically from very small values (e.g., 1 or 2) to larger values (e.g., 20 or 30). The actual range depends on your dataset and problem.
  1. Cross-Validation: Implement a cross-validation loop on the training data. This loop involves the following steps:
      • For each K value in your chosen range, do the following:
        • Split the training data into multiple folds (e.g., 5 or 10).
        • For each fold, train a k-NN model using the current K value and evaluate its performance on the remaining part of the training data (validation set).
        • Calculate the average performance (e.g., accuracy, F1-score) of the K-NN models across all folds for the current K value.
      • Repeat this process for each K value in your range.
  1. Select the Best K: Based on the cross-validation results, choose the K value that gives the best average performance on the validation set. This K value is considered the optimal K for your model.
  1. Final Evaluation: Train a k-NN model using the chosen optimal K on the entire training set (including the validation set). Then, evaluate its performance on the test set to get an unbiased estimate of its accuracy or other relevant metrics.
 

Pros and Cons

Pros

  • A lazy-learning algorithm, no actual training step, new data is simply tagged to a majority class, based on historical data. Very easy to understand and implement.
  • Can be used for both classification and regression.
  • Only one hyper-parameter: k value.
  • Non-parametric, makes no assumptions about the data/parameters
    • Parametric: Assume statistical distributions in the data, several conditions need to be met (Ex: Linear Regression)
    • Non-Parametric: Makes no assumptions about data distribution/conditions for the data to meet.

Cons

  • Struggles with large high dimensions, the greater the number of dimensions the harder for the algorithm to efficiently calculate distance (Curse of Dimensionality).
    • Often need to use dimensionality reduction techniques, especially in regression tasks with noisy data
  • Very sensitive to outliers & noise
  • Can become very computationally expensive as the dataset grows, need a lot of memory and can become very slow with a large sized dataset
  • K value selection, often need to estimate ranges or combine with cross-validation techniques to obtain the optimal k value selection
    • Frequent technique is to plot an elbow graph to find optimal k value
    •  

Python Implementation

Python Example 1

#import libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns #getting data df = pd.read_csv('KNN_Project_Data') #EDA # THIS IS GOING TO BE A VERY LARGE PLOT sns.pairplot(df,hue='TARGET CLASS',palette='coolwarm') #Standardize Variables from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(df.drop('TARGET CLASS',axis=1)) #transform to scaled version scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1)) #convert df to scaled features df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1]) df_feat.head() #Train Test Split from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'], test_size=0.30) #Using KNN from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train,y_train) #Predictions and Evaluations pred = knn.predict(X_test) from sklearn.metrics import classification_report,confusion_matrix print(confusion_matrix(y_test,pred)) print(classification_report(y_test,pred)) #Choosing K Value error_rate = [] # Will take some time for i in range(1,40): knn = KNeighborsClassifier(n_neighbors=i) knn.fit(X_train,y_train) pred_i = knn.predict(X_test) error_rate.append(np.mean(pred_i != y_test)) #plot elbow plt.figure(figsize=(10,6)) plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10) plt.title('Error Rate vs. K Value') plt.xlabel('K') plt.ylabel('Error Rate') #Retrain with new K value # NOW WITH K=30 knn = KNeighborsClassifier(n_neighbors=30) knn.fit(X_train,y_train) pred = knn.predict(X_test) print('WITH K=30') print('\n') print(confusion_matrix(y_test,pred)) print('\n') print(classification_report(y_test,pred))
 

Python Example 2

 
#1 IMPORT libraies import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import make_blobs from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split #2 Normalizing & Splitting the Data # Split the data into features (X) and target (y) X = df.drop('fraud', axis=1) y = df['fraud'] # Split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Scale the features using StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) #3 Fitting and Evaluating the Model knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print("Accuracy:", accuracy) Accuracy: 0.875
Unfortunately, there is no magic way to find the best value for k. We have to loop through many different values, then use our best judgment.
In the below code, we select a range of values for k and create an empty list to store our results. We use cross-validation to find the accuracy scores, which means we don’t need to create a training and test split, but we do need to scale our data. We then loop over the values and add the scores to our list.
#4 Using Cross Validation to Get the Best Value of k k_values = [i for i in range (1,31)] scores = [] scaler = StandardScaler() X = scaler.fit_transform(X) for k in k_values: knn = KNeighborsClassifier(n_neighbors=k) score = cross_val_score(knn, X, y, cv=5) scores.append(np.mean(score)) sns.lineplot(x = k_values, y = scores, marker = 'o') plt.xlabel("K Values") plt.ylabel("Accuracy Score") #5 More Evaluation Metrics #We can now train our model using the best k value using the code below. best_index = np.argmax(scores) best_k = k_values[best_index] knn = KNeighborsClassifier(n_neighbors=best_k) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) print("Accuracy:", accuracy) print("Precision:", precision) print("Recall:", recall) Accuracy: 0.875 Precision: 0.75 Recall: 1.0
We can see from our chart that k = 9, 10, 11, 12, and 13 all have an accuracy score of just under 95%. As these are tied for the best score, it is advisable to use a smaller value for k. This is because when using higher values of k, the model will use more data points that are further away from the original. Another option would be to explore other evaluation metrics
notion image