Random Forests are supervised machine learning algorithms used for both classification and regression. In scikit-learn, the main classes are RandomForestClassifier and RandomForestRegressor. A random forest is a meta-estimator that fits many decision trees on different subsamples of the data and uses averaging to improve predictive accuracy and reduce overfitting compared with a single tree.
A Random Forest builds many decision trees instead of relying on just one.
For classification:
For regression:
This averaging is the main reason Random Forests are usually more stable and more accurate than a single decision tree. Scikit-learn explicitly describes random forests as ensembles that use averaging to improve predictive accuracy and control overfitting.
Random Forests are popular because they are:
They are often a strong baseline for many practical machine learning problems. They belong to scikit-learn’s ensemble methods family, which combines multiple predictors to improve robustness and generalization.
The method is “random” for two main reasons.
Each tree is trained on a different subsample of the data. In scikit-learn, this is controlled by bootstrap=True by default, and the sub-sample size can also be controlled with max_samples.
At each split, the tree considers only a subset of features rather than all of them. This randomness helps the trees become less correlated with each other, which improves the ensemble effect. The main control for this is max_features.
So the forest becomes strong because:
A Decision Tree is one tree.
A Random Forest is many trees combined together.
Scikit-learn’s documentation emphasizes that random forests improve predictive accuracy and control overfitting by averaging many trees.
Random Forests are based on decision trees, and trees split on thresholds such as:
feature <= valuefeature > valueThey do not rely on geometric distances like KNN or margin geometry like SVM. Because of this, feature scaling is usually not necessary for Random Forests. This follows from the tree-based structure described in scikit-learn’s ensemble and tree documentation.
That is one of the practical advantages of tree-based models.
pip install numpy pandas matplotlib scikit-learnWe will use the Breast Cancer dataset from scikit-learn.
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Build model
model = RandomForestClassifier(
n_estimators=100,
random_state=42
)
# Train
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))In scikit-learn, n_estimators is the number of trees in the forest, and RandomForestClassifier fits multiple decision tree classifiers on different sub-samples of the dataset.
Random Forests can also output probabilities with predict_proba.
proba = model.predict_proba(X_test[:5])
print(proba)For classification, the forest combines information from its trees to produce class probabilities and class predictions. The classifier API in scikit-learn includes predict, predict_proba, and score.
RandomForestClassifierImportant parameters include:
n_estimatorscriterionmax_depthmin_samples_splitmin_samples_leafmax_featuresbootstrapmax_samplesoob_scoren_jobsrandom_stateclass_weight n_estimatorsNumber of trees in the forest.
max_depthMaximum depth of each tree.
max_featuresNumber of features considered at each split.
This is one of the key sources of randomness in the forest.
bootstrapWhether each tree is trained on a bootstrap sample. Default is True.
oob_scoreWhether to use out-of-bag samples for validation. This is only available when bootstrapping is enabled.
When bootstrap sampling is used, each tree is trained on a sample drawn with replacement from the training set. That means some training points are left out for that tree. These are called out-of-bag samples.
You can use them to estimate performance without needing a separate validation set for every tree.
model = RandomForestClassifier(
n_estimators=200,
oob_score=True,
random_state=42
)
model.fit(X_train, y_train)
print("OOB Score:", model.oob_score_)Scikit-learn exposes oob_score directly in the Random Forest API when bootstrapping is enabled.
Random Forests are a form of bagging.
Bagging means:
Averaging reduces variance. This is especially helpful for decision trees because single trees can vary a lot from one sample to another. Random forests add extra randomness through feature subsampling on top of bagging.
In a Random Forest:
Adding trees usually helps until performance levels off, though training and prediction become slower. n_estimators is therefore one of the most important practical controls in scikit-learn’s API.
Random Forests in scikit-learn expose feature_importances_.
import pandas as pd
importance = pd.Series(model.feature_importances_, index=data.feature_names)
print(importance.sort_values(ascending=False))Scikit-learn’s forest importance example shows how a forest of trees can be used to evaluate feature importance, and it also visualizes variability across trees.
Feature importance helps you understand which variables the forest relied on the most.
import matplotlib.pyplot as plt
importance = importance.sort_values(ascending=True)
plt.figure(figsize=(8, 6))
importance.plot(kind="barh")
plt.title("Random Forest Feature Importance")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.show()The scikit-learn example on forest importances uses impurity-based importances and shows inter-tree variability with error bars.
Scikit-learn also provides an example comparing impurity-based Random Forest importances with permutation importance, and it warns that impurity-based importance can inflate the importance of numerical features in some settings.
So built-in feature importance is useful, but it should be interpreted carefully.
from sklearn.inspection import permutation_importance
result = permutation_importance(
model, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
)
perm_importance = pd.Series(result.importances_mean, index=data.feature_names)
print(perm_importance.sort_values(ascending=False))Permutation importance is often a better diagnostic when you want a more reliable picture of feature influence. This is an inference grounded in scikit-learn’s comparison example.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
"n_estimators": [100, 200, 300],
"max_depth": [None, 5, 10, 20],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
"max_features": ["sqrt", "log2"]
}
grid = GridSearchCV(
estimator=RandomForestClassifier(random_state=42),
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)
print("Best CV Score:", grid.best_score_)
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))Instead of guessing the best forest size or depth, you let cross-validation search for a stronger combination.
The most influential tuning parameters in practice are usually:
n_estimatorsmax_depthmin_samples_splitmin_samples_leafmax_features A strong default starting point for classification is often something like:
model = RandomForestClassifier(
n_estimators=200,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
max_features="sqrt",
random_state=42,
n_jobs=-1
)This is not universally optimal, but it is a sensible baseline built from the core parameters exposed in the current scikit-learn API.
RandomForestRegressorNow let us use a Random Forest for regression.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Synthetic data
rng = np.random.RandomState(42)
X = np.sort(5 * rng.rand(200, 1), axis=0)
y = np.sin(X).ravel()
# Add noise
y[::5] += 0.5 - rng.rand(40)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestRegressor(
n_estimators=200,
random_state=42
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))Scikit-learn defines RandomForestRegressor analogously to the classifier: it fits many decision tree regressors on different subsamples and uses averaging to improve predictive accuracy and control overfitting.
X_plot = np.linspace(X.min(), X.max(), 500).reshape(-1, 1)
y_plot = model.predict(X_plot)
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label="Data")
plt.plot(X_plot, y_plot, linewidth=2, label="Random Forest prediction")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Random Forest Regression")
plt.legend()
plt.show()Random Forest regression often gives smoother and more robust predictions than a single decision tree regressor because it averages many trees. That is an inference directly supported by the ensemble definition in the scikit-learn API.
RandomForestRegressor shares many structural parameters with the classifier version, including:
n_estimatorscriterionmax_depthmin_samples_splitmin_samples_leafmax_featuresbootstrapmax_samplesoob_scoren_jobs Example:
model = RandomForestRegressor(
n_estimators=300,
max_depth=10,
min_samples_leaf=2,
random_state=42,
n_jobs=-1
)import pandas as pd
df = pd.read_csv("your_data.csv")X = df.drop("target", axis=1)
y = df["target"]from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
from sklearn.model_selection import GridSearchCV
param_grid = {
"n_estimators": [100, 200],
"max_depth": [None, 5, 10],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
"max_features": ["sqrt", "log2"]
}
grid = GridSearchCV(
model,
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X_train, y_train)from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Params:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
new_samples = X_test.iloc[:5]
predictions = best_model.predict(new_samples)
print(predictions)This workflow follows scikit-learn’s standard estimator and model selection pattern for ensemble estimators.
For classification, common metrics are:
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))Random forests support binary and multiclass classification through the same RandomForestClassifier API.
For regression, common metrics are:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)
print("R²:", r2)Random forest regressors also support multi-output regression natively, as shown in scikit-learn’s example comparing random forest regression with a multi-output meta-estimator.
Random Forests are strong because they are:
These strengths follow directly from scikit-learn’s description of random forests as ensembles of trees that improve predictive accuracy and control overfitting.
They also have limitations:
Scikit-learn’s comparison examples also show that computation time matters when comparing forest methods with alternatives such as Histogram Gradient Boosting.
Bad:
model = RandomForestClassifier(n_estimators=10, random_state=42)This may work, but the predictions can be less stable.
Better:
model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)Because the forest prediction is an average across trees, too few trees can leave performance unnecessarily noisy. This is a practical inference grounded in the ensemble formulation.
max_depth and leaf controlsEven though Random Forests reduce overfitting compared with a single tree, forests can still become unnecessarily complex.
Good parameters to tune:
max_depthmin_samples_splitmin_samples_leafmax_features Impurity-based importance is convenient, but scikit-learn’s permutation importance comparison shows it can overstate the importance of some feature types. Use it carefully, and compare with permutation importance when interpretability matters.
Use Random Forests when:
Be careful with Random Forests when:
In those cases, simpler models or other ensemble methods may be worth comparing. Scikit-learn’s ensemble examples include comparisons with Histogram Gradient Boosting specifically on score and computation time.
For classification:
RandomForestClassifier(
n_estimators=200,
max_features="sqrt",
random_state=42,
n_jobs=-1
)For regression:
RandomForestRegressor(
n_estimators=200,
max_features=1.0,
random_state=42,
n_jobs=-1
)These are sensible baseline starting points based on the current parameterized API for the stable scikit-learn random forest estimators.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Load
data = load_iris()
X, y = data.data, data.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Model
model = RandomForestClassifier(random_state=42)
# Search space
param_grid = {
"n_estimators": [100, 200, 300],
"max_depth": [None, 3, 5, 10],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
"max_features": ["sqrt", "log2"]
}
# Grid search
grid = GridSearchCV(
model,
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X_train, y_train)
# Evaluate
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Best Parameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))This matches scikit-learn’s general estimator, ensemble, and cross-validation workflow.
Random Forests are one of the most practical and powerful classical machine learning algorithms.
The core idea is:
For classification, they vote.
For regression, they average.
This improves predictive accuracy and helps control overfitting compared with a single decision tree.
The most important practical rules are:
n_estimatorsmax_depth, min_samples_split, and min_samples_leafmax_featuresThese recommendations align with the current scikit-learn Random Forest API, user guide, and feature-importance examples.
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
# X, y = your data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = RandomForestClassifier(random_state=42)
param_grid = {
"n_estimators": [100, 200],
"max_depth": [None, 5, 10],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4],
"max_features": ["sqrt", "log2"]
}
grid = GridSearchCV(
model,
param_grid=param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1
)
grid.fit(X_train, y_train)
print("Best params:", grid.best_params_)
print("Test score:", grid.best_estimator_.score(X_test, y_test))
y_pred = grid.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred))Train a RandomForestClassifier on the Iris dataset and report accuracy.
Train a Random Forest on the Breast Cancer dataset and display feature importances.
Tune n_estimators, max_depth, and max_features using GridSearchCV.
Compare a single DecisionTreeClassifier with a RandomForestClassifier.
Train a RandomForestRegressor on a nonlinear synthetic regression dataset.