Model Optimization in Machine Learning

Model optimization is a core skill in machine learning.
A good model is not just accurate on training data — it must generalize well to unseen data.

In this tutorial, you’ll learn how to diagnose model problems and systematically improve performance using proven techniques.

1️⃣ Overfitting vs Underfitting

🔴 Underfitting

A model underfits when it is too simple to capture patterns in the data.

Symptoms

Poor performance on training data
Poor performance on test data

Common causes

Model too simple
Important features missing
Insufficient training time

Example

Linear regression on highly nonlinear data

🔵 Overfitting

A model overfits when it learns noise instead of patterns.

Symptoms

Excellent training accuracy
Poor test/validation accuracy

Common causes

Model too complex
Too many parameters
Small dataset

Example

Very deep decision tree

⚖️ Visual Intuition

Model Complexity	Training Error	Test Error
Too simple	High	High
Optimal	Low	Low
Too complex	Very Low	High

2️⃣ Bias–Variance Tradeoff

The bias–variance tradeoff explains why models fail to generalize.

🎯 Bias

Error due to wrong assumptions in the model.

High bias → underfitting
Example: linear model for nonlinear data

🎯 Variance

Error due to sensitivity to data fluctuations.

High variance → overfitting
Example: deep decision tree

⚖️ Tradeoff Summary

Problem	Bias	Variance
Underfitting	High	Low
Overfitting	Low	High
Good Model	Balanced	Balanced

Goal: find the sweet spot where both bias and variance are minimized.

3️⃣ Cross-Validation

❓ Why Cross-Validation?

A single train/test split can be misleading.

Cross-validation provides a robust estimate of model performance.

🔁 K-Fold Cross-Validation

Split data into K folds
Train on K-1 folds
Validate on the remaining fold
Repeat K times
Average the scores

🧠 Advantages

Better generalization estimate
Reduces dependence on a single split
Essential for hyperparameter tuning

🐍 Python Example (scikit-learn)

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

model = LogisticRegression(max_iter=200)

scores = cross_val_score(model, X, y, cv=5)

print("Cross-validation scores:", scores)
print("Mean accuracy:", scores.mean())

4️⃣ Hyperparameter Tuning

❓ What Are Hyperparameters?

Parameters set before training, not learned from data.

Examples:

max_depth (Decision Tree)
C (Logistic Regression)
n_neighbors (KNN)
learning_rate (Boosting)

🎯 Why Tune Them?

Default values are not optimal for all datasets.

Hyperparameter tuning helps:

Reduce overfitting
Improve accuracy
Stabilize predictions

5️⃣ GridSearchCV

🔍 What Is Grid Search?

Tries all combinations of hyperparameters.

✔️ Exhaustive
❌ Computationally expensive

🐍 Example: GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5]
}

model = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X, y)

print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

🧠 When to Use Grid Search

Small hyperparameter space
You want guaranteed best combination

6️⃣ RandomizedSearchCV

🎲 What Is Random Search?

Samples random combinations from hyperparameter distributions.

✔️ Faster
✔️ Scales better
❌ Not exhaustive

🐍 Example: RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 10)
}

random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X, y)

print("Best parameters:", random_search.best_params_)
print("Best score:", random_search.best_score_)

🧠 When to Use Random Search

Large parameter space
Limited computation time
Complex models

7️⃣ Grid Search vs Random Search

Feature	GridSearchCV	RandomizedSearchCV
Speed	Slow	Fast
Exhaustive	Yes	No
Scalability	Poor	Good
Best for	Small spaces	Large spaces

8️⃣ Best Practices for Model Optimization

✅ Always use cross-validation
✅ Tune hyperparameters after preprocessing
✅ Optimize metrics relevant to the problem
✅ Avoid tuning on test data
✅ Combine with feature engineering