Best Practices & Common Mistakes

Machine Learning models often fail not because of algorithms, but because of poor practices during data handling, evaluation, and experimentation.

This tutorial covers the most common ML mistakes, explains why they are dangerous, and shows best practices with Python examples to avoid them.

1️⃣ Data Leakage 🚨 (The Silent Model Killer)

🔍 What Is Data Leakage?

Data leakage occurs when information from outside the training dataset is incorrectly used to create the model.

👉 The model learns patterns it should never have access to in real life.

❌ Common Data Leakage Examples

Mistake	Why It's Wrong
Scaling before train/test split	Test data influences training
Feature contains future info	Impossible in production
Target used indirectly as feature	Artificially high accuracy
Data leakage via aggregation	Group-level info leaks labels

❌ Bad Example (Data Leakage)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # ❌ uses all data

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2
)

✅ Correct Practice (No Leakage)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

pipeline.fit(X_train, y_train)

✔️ Pipelines guarantee no leakage

🧠 Best Practices

Always split data first
Use Pipeline / ColumnTransformer
Avoid features based on future events
Be cautious with group statistics

2️⃣ Imbalanced Datasets ⚖️

🔍 What Is Class Imbalance?

When one class dominates the dataset:

Class 0: 95%
Class 1: 5%

A naive model predicting always Class 0 gets 95% accuracy — but is useless.

❌ Wrong Thinking

"My model has 98% accuracy, so it's good."

🚫 Accuracy is misleading for imbalanced data.

✅ Detecting Imbalance

import pandas as pd

pd.Series(y).value_counts(normalize=True)

✅ Better Metrics for Imbalanced Data

Metric	Use Case
Precision	Cost of false positives
Recall	Cost of false negatives
F1-score	Balance
ROC-AUC	Ranking quality
PR-AUC	Rare events

✅ Handling Imbalance

🔹 1. Class Weights

LogisticRegression(class_weight="balanced")

🔹 2. Resampling (SMOTE)

from imblearn.over_sampling import SMOTE

X_res, y_res = SMOTE().fit_resample(X_train, y_train)

🔹 3. Threshold Tuning

y_proba = model.predict_proba(X_test)[:, 1]
y_pred = (y_proba > 0.3).astype(int)

🧠 Best Practices

Always inspect class distribution
Never rely on accuracy alone
Prefer Recall / F1 / AUC
Tune decision thresholds

3️⃣ Choosing the Wrong Metrics 📉

❌ Common Metric Mistakes

Problem	Wrong Metric
Fraud detection	Accuracy
Medical diagnosis	Accuracy
Regression with outliers	MSE
Ranking systems	RMSE

✅ Metric Selection Guide

🔹 Regression

Metric	When to Use
MAE	Robust to outliers
RMSE	Penalize large errors
R²	Overall fit

🔹 Classification

Metric	Use
Precision	False positives costly
Recall	False negatives costly
F1-score	Balanced
ROC-AUC	Probability ranking

✅ Example: Confusion Matrix

from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred)

🧠 Best Practices

Match metrics to business objective
Use multiple metrics
Visualize performance (ROC, PR curves)
Explain metrics to stakeholders

4️⃣ Feature Scaling Mistakes 📏

🔍 Why Feature Scaling Matters

Some algorithms depend on distance or magnitude.

Needs Scaling	Doesn’t
KNN	Decision Trees
SVM	Random Forest
Linear Regression	Gradient Boosting
Neural Networks

❌ Common Mistakes

Scaling target variable by accident
Using wrong scaler
Scaling before splitting data
Forgetting to scale at inference

✅ Correct Scaling with Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", SVC())
])

🔹 Choosing the Right Scaler

Scaler	Use Case
StandardScaler	Normal distributions
MinMaxScaler	Neural networks
RobustScaler	Outliers present

🧠 Best Practices

Scale features only
Use Pipelines
Save scaler for deployment
Know your algorithm

5️⃣ Reproducibility 🔁 (Often Ignored, Always Critical)

🔍 What Is Reproducibility?

Getting the same results:

Across runs
Across machines
Across collaborators

❌ Common Problems

No random_state
Random splits
Untracked data versions
Unlogged hyperparameters

✅ Fix Randomness

import numpy as np
import random

np.random.seed(42)
random.seed(42)

train_test_split(X, y, random_state=42)

✅ Reproducible Models

RandomForestClassifier(random_state=42)

✅ Experiment Tracking (Recommended)

MLflow
Weights & Biases
TensorBoard
CSV / JSON logging

🧠 Best Practices

Fix random seeds everywhere
Version data & code
Log metrics & parameters
Save models properly

✅ Final Checklist (Best Practices Summary)

✔️ Split data before preprocessing
✔️ Use pipelines
✔️ Check class imbalance
✔️ Use correct evaluation metrics
✔️ Scale features properly
✔️ Avoid data leakage
✔️ Ensure reproducibility
✔️ Think about deployment early

🎯 Conclusion

Machine Learning success is 80% methodology, 20% algorithms.

Avoiding these common mistakes will dramatically improve model reliability, trust, and real-world performance.