Machine Learning models often fail not because of algorithms, but because of poor practices during data handling, evaluation, and experimentation.
This tutorial covers the most common ML mistakes, explains why they are dangerous, and shows best practices with Python examples to avoid them.
Data leakage occurs when information from outside the training dataset is incorrectly used to create the model.
π The model learns patterns it should never have access to in real life.
| Mistake | Why It's Wrong |
|---|---|
| Scaling before train/test split | Test data influences training |
| Feature contains future info | Impossible in production |
| Target used indirectly as feature | Artificially high accuracy |
| Data leakage via aggregation | Group-level info leaks labels |
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # β uses all data
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2
)from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression())
])
pipeline.fit(X_train, y_train)βοΈ Pipelines guarantee no leakage
When one class dominates the dataset:
Class 0: 95%
Class 1: 5%
A naive model predicting always Class 0 gets 95% accuracy β but is useless.
"My model has 98% accuracy, so it's good."
π« Accuracy is misleading for imbalanced data.
import pandas as pd
pd.Series(y).value_counts(normalize=True)| Metric | Use Case |
|---|---|
| Precision | Cost of false positives |
| Recall | Cost of false negatives |
| F1-score | Balance |
| ROC-AUC | Ranking quality |
| PR-AUC | Rare events |
LogisticRegression(class_weight="balanced")from imblearn.over_sampling import SMOTE
X_res, y_res = SMOTE().fit_resample(X_train, y_train)y_proba = model.predict_proba(X_test)[:, 1]
y_pred = (y_proba > 0.3).astype(int)| Problem | Wrong Metric |
|---|---|
| Fraud detection | Accuracy |
| Medical diagnosis | Accuracy |
| Regression with outliers | MSE |
| Ranking systems | RMSE |
| Metric | When to Use |
|---|---|
| MAE | Robust to outliers |
| RMSE | Penalize large errors |
| RΒ² | Overall fit |
| Metric | Use |
|---|---|
| Precision | False positives costly |
| Recall | False negatives costly |
| F1-score | Balanced |
| ROC-AUC | Probability ranking |
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)Some algorithms depend on distance or magnitude.
| Needs Scaling | Doesnβt |
|---|---|
| KNN | Decision Trees |
| SVM | Random Forest |
| Linear Regression | Gradient Boosting |
| Neural Networks |
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", SVC())
])| Scaler | Use Case |
|---|---|
| StandardScaler | Normal distributions |
| MinMaxScaler | Neural networks |
| RobustScaler | Outliers present |
Getting the same results:
random_stateimport numpy as np
import random
np.random.seed(42)
random.seed(42)train_test_split(X, y, random_state=42)RandomForestClassifier(random_state=42)βοΈ Split data before preprocessing
βοΈ Use pipelines
βοΈ Check class imbalance
βοΈ Use correct evaluation metrics
βοΈ Scale features properly
βοΈ Avoid data leakage
βοΈ Ensure reproducibility
βοΈ Think about deployment early
Machine Learning success is 80% methodology, 20% algorithms.
Avoiding these common mistakes will dramatically improve model reliability, trust, and real-world performance.