What is k-Nearest Neighbors (k-NN)?
k-Nearest Neighbors (k-NN) is one of the simplest and most intuitive supervised learning algorithms.
- It’s used for classification (predicting categories) and regression (predicting continuous values). 
- The idea: - Store all training data. 
- To predict a new point, look at its k closest neighbors (using distance, usually Euclidean). 
- For classification: take a majority vote of neighbors’ classes. 
- For regression: take the average of neighbors’ values. 
 
Example with Iris 🌸
- Suppose k=3. 
- A new flower is measured: - [5.1, 3.5, 1.4, 0.2].
- The algorithm finds the 3 closest flowers in training data. 
- If 2 are Setosa and 1 is Versicolor, prediction = Setosa. 
👉 k-NN is called a “lazy learner” because it doesn’t build a mathematical model; it just stores the training set and uses it when making predictions.
Choosing k
- Small k (like 1) → very sensitive to noise (overfits). 
- Large k → smoother, but may miss details (underfits). 
- Usually odd k (3, 5, 7) is chosen to avoid ties. 
Example: Classifying a New Flower with k-NN
Training Data (simplified)
Suppose we only have 6 flowers in our training set:
| Sepal Length | Sepal Width | Petal Length | Petal Width | Species (Label) | 
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | Setosa (0) | 
| 4.9 | 3.0 | 1.4 | 0.2 | Setosa (0) | 
| 5.8 | 2.7 | 4.1 | 1.0 | Versicolor (1) | 
| 6.0 | 2.7 | 5.1 | 1.6 | Versicolor (1) | 
| 6.3 | 3.3 | 6.0 | 2.5 | Virginica (2) | 
| 5.8 | 2.7 | 5.1 | 1.9 | Virginica (2) | 
🌸 New Flower to Classify
Features = [5.7, 3.0, 4.2, 1.2]
(We don’t know the species, that’s what we want to predict.)
Step 1: Compute Distances
We use Euclidean distance:

For example, distance from new flower to the first Setosa:

Simplifying step by step:



👉 Do this for all 6 training samples (I won’t calculate all here, but you get the idea).
Step 2: Find Nearest Neighbors
Suppose the 3 smallest distances (k=3) are to:
- A Versicolor (distance ~0.45) 
- Another Versicolor (distance ~0.9) 
- A Virginica (distance ~1.3) 
Step 3: Majority Voting
- Versicolor = 2 votes 
- Virginica = 1 vote 
- Setosa = 0 votes 
✅ Predicted Class = Versicolor (1)
Why Scaling Matters in k-NN
1. k-NN is based on distance
- Prediction is made by finding the closest neighbors in feature space. 
- So, features with larger numeric ranges contribute more to the distance. 
2. Example of Imbalance
Suppose we have two features to classify fruits:
- Weight (in grams) → ranges from 100g to 1000g. 
- Color (encoded 0=green, 1=red). 
Now compare two fruits:
- Fruit A = - [150, 0](150g, green)
- Fruit B = - [900, 1](900g, red)
- New Fruit = - [160, 1](160g, red)
Distances:
- To A:    - To B:    
👉 The “color” difference (0 vs 1) barely matters compared to the huge “weight” difference.
Even though color is very important to classify fruits, it gets ignored.
3. Effect: Bias in Distance
- Features with large scales dominate. 
- Features with small scales are ignored. 
- Model may perform badly because it uses the “wrong” feature importance. 
Solution: Feature Scaling
Two common preprocessing techniques:
🔹 Normalization (Min-Max Scaling)
Rescales values into the range [0,1].

Example: If weight ranges from 100–1000, then

Now both weight and color are in comparable ranges.
🔹 Standardization (Z-score scaling)
Centers features around 0 with standard deviation 1:

Example: If petal length mean = 3.7 cm and std = 1.7, then

After scaling, each feature contributes equally in distance calculations.
Visual Example (Iris)
Without scaling, suppose:
- Sepal length (cm) ranges from 4–8. 
- Petal length (cm) ranges from 1–7. 
Petal length has a larger spread, so distance is mostly determined by petal length.
After scaling, both features are equally important.
✅ Summary
- k-NN is distance-based, so scaling is critical. 
- Without scaling, large-valued features dominate. 
- Normalization and standardization put features on the same “footing.” 
- Always scale features when using k-NN, SVM, clustering, PCA, etc. 
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# Training data (features)
X_train = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3.0, 1.4, 0.2],
    [5.8, 2.7, 4.1, 1.0],
    [6.0, 2.7, 5.1, 1.6],
    [6.3, 3.3, 6.0, 2.5],
    [5.8, 2.7, 5.1, 1.9]
])
y_train = np.array([0, 0, 1, 1, 2, 2])  # Labels: 0=Setosa, 1=Versicolor, 2=Virginica
# New flower
X_new = np.array([[5.7, 3.0, 4.2, 1.2]])
# Train k-NN with k=3
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Prediction
prediction = knn.predict(X_new)
print("Predicted class:", prediction)
# Output: Predicted class: [1] → Versicolor
Normalization
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Example data: weight (100–1000), color (0 or 1)
X = np.array([[150, 0], [900, 1], [160, 1]])
# Min-Max Normalization
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X)
print("Normalized:\n", X_norm)
# Standardization
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
print("Standardized:\n", X_std)
 
				





