Machine Learning

k-Nearest Neighbors (k-NN) classifier – Supervised Learning

What is k-Nearest Neighbors (k-NN)?

k-Nearest Neighbors (k-NN) is one of the simplest and most intuitive supervised learning algorithms.

It’s used for classification (predicting categories) and regression (predicting continuous values).
The idea:
1. Store all training data.
2. To predict a new point, look at its k closest neighbors (using distance, usually Euclidean).
3. For classification: take a majority vote of neighbors’ classes.
4. For regression: take the average of neighbors’ values.

Example with Iris 🌸

Suppose k=3.
A new flower is measured: [5.1, 3.5, 1.4, 0.2].
The algorithm finds the 3 closest flowers in training data.
If 2 are Setosa and 1 is Versicolor, prediction = Setosa.

👉 k-NN is called a “lazy learner” because it doesn’t build a mathematical model; it just stores the training set and uses it when making predictions.

Choosing k

Small k (like 1) → very sensitive to noise (overfits).
Large k → smoother, but may miss details (underfits).
Usually odd k (3, 5, 7) is chosen to avoid ties.

Example: Classifying a New Flower with k-NN

Training Data (simplified)

Suppose we only have 6 flowers in our training set:

Sepal Length	Sepal Width	Petal Length	Petal Width	Species (Label)
5.1	3.5	1.4	0.2	Setosa (0)
4.9	3.0	1.4	0.2	Setosa (0)
5.8	2.7	4.1	1.0	Versicolor (1)
6.0	2.7	5.1	1.6	Versicolor (1)
6.3	3.3	6.0	2.5	Virginica (2)
5.8	2.7	5.1	1.9	Virginica (2)

🌸 New Flower to Classify

Features = [5.7, 3.0, 4.2, 1.2]
(We don’t know the species, that’s what we want to predict.)

Step 1: Compute Distances

We use Euclidean distance:

$d(\text{point1}, \text{point2}) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2 + \dots}$

For example, distance from new flower to the first Setosa:

$d = \sqrt{(5.7 - 5.1)^2 + (3.0 - 3.5)^2 + (4.2 - 1.4)^2 + (1.2 - 0.2)^2}$

Simplifying step by step:

$= \sqrt{(0.6)^2 + (-0.5)^2 + (2.8)^2 + (1.0)^2}$

$= \sqrt{0.36 + 0.25 + 7.84 + 1.00}$

$= \sqrt{9.45} \approx 3.07$

👉 Do this for all 6 training samples (I won’t calculate all here, but you get the idea).

Step 2: Find Nearest Neighbors

Suppose the 3 smallest distances (k=3) are to:

A Versicolor (distance ~0.45)
Another Versicolor (distance ~0.9)
A Virginica (distance ~1.3)

Step 3: Majority Voting

Versicolor = 2 votes
Virginica = 1 vote
Setosa = 0 votes

✅ Predicted Class = Versicolor (1)

Why Scaling Matters in k-NN

1. k-NN is based on distance

Prediction is made by finding the closest neighbors in feature space.

So, features with larger numeric ranges contribute more to the distance.

2. Example of Imbalance

Suppose we have two features to classify fruits:

Weight (in grams) → ranges from 100g to 1000g.
Color (encoded 0=green, 1=red).

Now compare two fruits:

Fruit A = [150, 0] (150g, green)
Fruit B = [900, 1] (900g, red)
New Fruit = [160, 1] (160g, red)

Distances:

To A:
$d_A = \sqrt{(160 - 150)^2 + (1 - 0)^2}$
$= \sqrt{100 + 1}$
$= \sqrt{101} \approx 10.05$
To B:
$d_B = \sqrt{(160 - 900)^2 + (1 - 1)^2}$
$= \sqrt{547600}$
$= 740.0$

👉 The “color” difference (0 vs 1) barely matters compared to the huge “weight” difference.
Even though color is very important to classify fruits, it gets ignored.

3. Effect: Bias in Distance

Features with large scales dominate.
Features with small scales are ignored.
Model may perform badly because it uses the “wrong” feature importance.

Solution: Feature Scaling

Two common preprocessing techniques:

🔹 Normalization (Min-Max Scaling)

Rescales values into the range [0,1].

$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$

Example: If weight ranges from 100–1000, then

$x' = \frac{160 - 100}{900} = 0.067$

Now both weight and color are in comparable ranges.

🔹 Standardization (Z-score scaling)

Centers features around 0 with standard deviation 1:

$x' = \frac{x - \mu}{\sigma}$

Example: If petal length mean = 3.7 cm and std = 1.7, then

$x' = \frac{4.2 - 3.7}{1.7} = 0.29$

After scaling, each feature contributes equally in distance calculations.

Visual Example (Iris)

Without scaling, suppose:

Sepal length (cm) ranges from 4–8.
Petal length (cm) ranges from 1–7.

Petal length has a larger spread, so distance is mostly determined by petal length.
After scaling, both features are equally important.

✅ Summary

k-NN is distance-based, so scaling is critical.
Without scaling, large-valued features dominate.
Normalization and standardization put features on the same “footing.”
Always scale features when using k-NN, SVM, clustering, PCA, etc.

import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# Training data (features)
X_train = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3.0, 1.4, 0.2],
    [5.8, 2.7, 4.1, 1.0],
    [6.0, 2.7, 5.1, 1.6],
    [6.3, 3.3, 6.0, 2.5],
    [5.8, 2.7, 5.1, 1.9]
])
y_train = np.array([0, 0, 1, 1, 2, 2])  # Labels: 0=Setosa, 1=Versicolor, 2=Virginica

# New flower
X_new = np.array([[5.7, 3.0, 4.2, 1.2]])

# Train k-NN with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Prediction
prediction = knn.predict(X_new)
print("Predicted class:", prediction)

# Output: Predicted class: [1] → Versicolor

Normalization

import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Example data: weight (100–1000), color (0 or 1)
X = np.array([[150, 0], [900, 1], [160, 1]])

# Min-Max Normalization
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X)
print("Normalized:\n", X_norm)

# Standardization
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
print("Standardized:\n", X_std)