Model optimization is a core skill in machine learning.
A good model is not just accurate on training data — it must generalize well to unseen data.
In this tutorial, you’ll learn how to diagnose model problems and systematically improve performance using proven techniques.
A model underfits when it is too simple to capture patterns in the data.
Symptoms
Common causes
Example
A model overfits when it learns noise instead of patterns.
Symptoms
Common causes
Example
| Model Complexity | Training Error | Test Error |
|---|---|---|
| Too simple | High | High |
| Optimal | Low | Low |
| Too complex | Very Low | High |
The bias–variance tradeoff explains why models fail to generalize.
Error due to wrong assumptions in the model.
Error due to sensitivity to data fluctuations.
| Problem | Bias | Variance |
|---|---|---|
| Underfitting | High | Low |
| Overfitting | Low | High |
| Good Model | Balanced | Balanced |
Goal: find the sweet spot where both bias and variance are minimized.
A single train/test split can be misleading.
Cross-validation provides a robust estimate of model performance.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)
scores = cross_val_score(model, X, y, cv=5)
print("Cross-validation scores:", scores)
print("Mean accuracy:", scores.mean())
Parameters set before training, not learned from data.
Examples:
max_depth (Decision Tree)C (Logistic Regression)n_neighbors (KNN)learning_rate (Boosting)Default values are not optimal for all datasets.
Hyperparameter tuning helps:
Tries all combinations of hyperparameters.
✔️ Exhaustive
❌ Computationally expensive
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5]
}
model = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
estimator=model,
param_grid=param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search.fit(X, y)
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
Samples random combinations from hyperparameter distributions.
✔️ Faster
✔️ Scales better
❌ Not exhaustive
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = {
'n_estimators': randint(50, 300),
'max_depth': randint(3, 20),
'min_samples_split': randint(2, 10)
}
random_search = RandomizedSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_distributions=param_dist,
n_iter=20,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
random_search.fit(X, y)
print("Best parameters:", random_search.best_params_)
print("Best score:", random_search.best_score_)
| Feature | GridSearchCV | RandomizedSearchCV |
|---|---|---|
| Speed | Slow | Fast |
| Exhaustive | Yes | No |
| Scalability | Poor | Good |
| Best for | Small spaces | Large spaces |
✅ Always use cross-validation
✅ Tune hyperparameters after preprocessing
✅ Optimize metrics relevant to the problem
✅ Avoid tuning on test data
✅ Combine with feature engineering