K-Nearest Neighbors, or KNN, is a supervised machine learning method used for both classification and regression. In scikit-learn, the main classes are KNeighborsClassifier and KNeighborsRegressor. The nearest-neighbors family predicts from nearby examples in the training set rather than learning a parametric equation, which is why it is often described as an instance-based or lazy learning method.
KNN makes predictions by looking at the closest training samples to a new point. For classification, it uses the neighbors’ class labels to vote on the prediction. For regression, it predicts from the target values of the nearby samples. Scikit-learn’s user guide describes KNeighborsClassifier as classification based on the k nearest neighbors of each query point, and KNeighborsRegressor as regression based on the k nearest neighbors of each query point.
Imagine you want to classify a fruit using only:
If a new fruit is surrounded mostly by apples in feature space, KNN will likely classify it as an apple. If its closest neighbors are mostly oranges, it will likely predict orange.
That is the core idea:
k pointsKNN is a popular first algorithm because it is simple, intuitive, and often effective on small to medium datasets. It is also flexible because the same idea works for both classification and regression. In scikit-learn, you can choose how many neighbors to use, how to weight them, and which distance metric to apply.
KNN is especially useful when:
kThe parameter k is the number of nearest neighbors considered when making a prediction. In scikit-learn, this is the n_neighbors parameter, whose default is 5 for both KNeighborsClassifier and KNeighborsRegressor.
kIf k is very small, such as 1:
kIf k is large:
So choosing k is a balance between:
Suppose you have two classes:
For a new point:
kScikit-learn describes KNeighborsClassifier as implementing the vote of the nearest neighbors, and it exposes methods such as kneighbors to inspect the nearest samples and their distances.
If k = 5 and the 5 nearest neighbors are:
Then the prediction is class A.
KNN also works for regression. Instead of taking a majority vote, it combines the target values of nearby points. Scikit-learn states that KNeighborsRegressor predicts by local interpolation of the targets associated with the nearest neighbors.
If k = 3 and the three nearest target values are:
Then the prediction may be:
KNN depends heavily on the notion of distance. In scikit-learn, the default metric for the main KNN estimators is Minkowski distance, with parameter p=2, which corresponds to Euclidean distance. Changing p changes the distance behavior, such as p=1 for Manhattan distance.
Common choices:
If your distance metric does not match the structure of your data, the model may perform badly.
KNN is extremely sensitive to feature scale because distance calculations are central to the algorithm. If one feature has a much larger numeric range than another, it can dominate the distance. StandardScaler standardizes features by removing the mean and scaling to unit variance, and scikit-learn recommends Pipeline to chain preprocessing and the estimator so the same transformations are applied consistently during training and prediction.
Suppose you have:
Without scaling, salary can overpower age in the distance calculation.
So for KNN, scaling is usually essential.
pip install numpy pandas matplotlib scikit-learnWe will use the Breast Cancer dataset from scikit-learn.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Pipeline: scaling + KNN
model = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=5))
])
# Train
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))k=5KNeighborsClassifier is the scikit-learn estimator for KNN classification, while Pipeline keeps preprocessing and prediction together in one object.
A pipeline prevents common mistakes such as:
Scikit-learn documents Pipeline specifically as a tool to sequentially apply preprocessing steps and then a final predictor, and it is also useful for joint parameter selection during model tuning.
The make_moons dataset is very useful for understanding nonlinear decision boundaries.
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
# Create nonlinear dataset
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=5))
])
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))import numpy as np
import matplotlib.pyplot as plt
def plot_decision_boundary(model, X, y):
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, 400),
np.linspace(y_min, y_max, 400)
)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k")
plt.title("KNN Decision Boundary")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
plot_decision_boundary(model, X, y)KNN can create very flexible nonlinear boundaries because predictions depend on local neighborhoods rather than a global linear equation.

kTry these values:
k = 1k = 5k = 15for k in [1, 5, 15]:
model = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=k))
])
model.fit(X_train, y_train)
print(f"k={k}, accuracy={model.score(X_test, y_test):.4f}")k=1 often produces a very irregular boundaryk often gives a better balancek makes the boundary smootherBecause n_neighbors directly controls how many nearby examples vote, it is one of the most important KNN hyperparameters.
Scikit-learn provides a weights parameter in both classifier and regressor forms of KNN. Common choices are:
"uniform": all selected neighbors contribute equally"distance": closer neighbors contribute more from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
uniform_model = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=5, weights="uniform"))
])
distance_model = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=5, weights="distance"))
])
uniform_model.fit(X_train, y_train)
distance_model.fit(X_train, y_train)
print("Uniform weights accuracy:", uniform_model.score(X_test, y_test))
print("Distance weights accuracy:", distance_model.score(X_test, y_test))
Distance weighting can be useful when:
kThe scikit-learn nearest-neighbors classification example explicitly compares decision boundaries for different weights choices.
For KNeighborsClassifier and KNeighborsRegressor, the main parameters include:
n_neighborsweightsalgorithmleaf_sizepmetricn_jobs n_neighborsHow many neighbors to use.
weightsWhether neighbors all count equally or closer points count more.
pControls the Minkowski distance:
p=1 → Manhattanp=2 → EuclideanmetricDistance function used by the model.
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
pipeline = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier())
])
param_grid = {
"knn__n_neighbors": [3, 5, 7, 9, 11, 15],
"knn__weights": ["uniform", "distance"],
"knn__p": [1, 2]
}
grid = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)
print("Best CV Score:", grid.best_score_)
best_model = grid.best_estimator_
test_score = best_model.score(X_test, y_test)
print("Test Accuracy:", test_score)
It searches across:
k valuesThis is the best practical way to find a strong KNN setup.
Choosing k based only on one train/test split can be unstable. Cross-validation gives a more reliable estimate by evaluating multiple folds of the training data. This is one of the core advantages of using scikit-learn’s model-selection tools together with pipelines.
KNeighborsRegressorNow let us use KNN for regression.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
# Synthetic data
rng = np.random.RandomState(42)
X = np.sort(5 * rng.rand(200, 1), axis=0)
y = np.sin(X).ravel()
# Add noise
y[::5] += 0.5 - rng.rand(40)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsRegressor(n_neighbors=5, weights="distance"))
])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))X_plot = np.linspace(X.min(), X.max(), 500).reshape(-1, 1)
y_plot = model.predict(X_plot)
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label="Data")
plt.plot(X_plot, y_plot, linewidth=2, label="KNN prediction")
plt.xlabel("X")
plt.ylabel("y")
plt.title("KNN Regression")
plt.legend()
plt.show()KNeighborsRegressor predicts from nearby target values, and scikit-learn’s regression example shows this behavior for uniform and distance weighting.
k in regressionfor k in [2, 5, 10, 20]:
model = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsRegressor(n_neighbors=k, weights="distance"))
])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"k={k}, R²={r2_score(y_test, y_pred):.4f}")k: curve can become noisyk: curve becomes smootherk: model may become too flatThis is a solid real-project workflow for KNN.
import pandas as pd
df = pd.read_csv("your_data.csv")
X = df.drop("target", axis=1)
y = df["target"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
pipeline = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier())
])
from sklearn.model_selection import GridSearchCV
param_grid = {
"knn__n_neighbors": [3, 5, 7, 9, 11, 15],
"knn__weights": ["uniform", "distance"],
"knn__p": [1, 2]
}
grid = GridSearchCV(
pipeline,
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X_train, y_train)from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Params:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))new_samples = X_test.iloc[:5]
predictions = best_model.predict(new_samples)
print(predictions)
For KNN classification, common metrics are:
If classes are balanced, accuracy can be informative. If classes are imbalanced, you should look beyond accuracy and inspect the full classification report.
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))For KNN regression, common metrics are:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R²:", r2)KNN is attractive because:
Since nearest-neighbors methods base predictions directly on nearby training examples, they can adapt well to local patterns without fitting a fixed global function.
KNN also has important limitations:
k and distance choiceScikit-learn includes multiple nearest-neighbor search structures such as KDTree and BallTree, reflecting the fact that efficient neighbor lookup is an important practical concern in this family of algorithms.
This is one of the biggest KNN mistakes.
Bad:
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)Better:
model = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=5))
])
Because KNN depends on distances, scaling can completely change which points count as nearest neighbors. StandardScaler and Pipeline are the standard scikit-learn tools for handling this correctly.
k arbitrarilyDo not choose k=5 just because it is the default.
Instead:
k valuesThe documented default is 5, but there is nothing magical about it; it is just a starting point.
KNN can suffer badly when many features do not help, because they can distort distances. In practice, feature selection or dimensionality reduction may improve KNN performance.
KNN may be easy to fit, but prediction can become expensive because the algorithm must compare new points to stored training examples. Scikit-learn exposes algorithm choices such as auto, and data structures like KDTree and BallTree, precisely because neighbor search efficiency matters.
Use KNN when:
It is often a good early model to compare against more advanced methods.
Be careful with KNN when:
For classification:
Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier(n_neighbors=5, weights="distance"))
])For regression:
Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsRegressor(n_neighbors=5, weights="distance"))
])This is a strong starting point because:
weights="distance" often improves local sensitivityk=5 is a reasonable baseline to tune fromThe estimator defaults and options for weights, metric, p, and n_neighbors are documented in the current scikit-learn API.
Here is a compact but realistic mini-project using Iris.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Load
data = load_iris()
X, y = data.data, data.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Pipeline
pipeline = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier())
])
# Search space
param_grid = {
"knn__n_neighbors": [3, 5, 7, 9, 11],
"knn__weights": ["uniform", "distance"],
"knn__p": [1, 2]
}
# Grid search
grid = GridSearchCV(
pipeline,
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X_train, y_train)
# Evaluate
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Parameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))k, weights, and distance behaviorKNN is one of the easiest machine learning algorithms to understand.
The central idea is:
For KNN classification, neighbors vote.
For KNN regression, neighbors contribute numeric values.
The most important practical rules are:
n_neighborsThese recommendations are fully consistent with the current scikit-learn nearest-neighbors, preprocessing, and pipeline documentation.
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
# X, y = your data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
pipeline = Pipeline([
("scaler", StandardScaler()),
("knn", KNeighborsClassifier())
])
param_grid = {
"knn__n_neighbors": [3, 5, 7, 9, 11, 15],
"knn__weights": ["uniform", "distance"],
"knn__p": [1, 2]
}
grid = GridSearchCV(
pipeline,
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("Test score:", grid.best_estimator_.score(X_test, y_test))
y_pred = grid.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred))Train a KNN classifier on the Iris dataset and report accuracy.
Train KNN on make_moons and visualize the decision boundary.
Tune n_neighbors, weights, and p using GridSearchCV.
Compare weights="uniform" and weights="distance".
Train a KNeighborsRegressor model on a nonlinear synthetic regression dataset.