Linear Regression is one of the most important and widely used machine learning algorithms for regression tasks. In scikit-learn, the main estimator is LinearRegression, which fits a linear model to minimize the residual sum of squares between the observed targets and the predictions made by the linear approximation.
Linear Regression tries to model the relationship between one or more input features and a continuous target value using a straight-line formula.
In mathematical form:
$\hat{y} = w_0 + w_1x_1 + w_2x_2 + \dots + w_px_p$
Scikit-learn’s linear models guide describes this exactly: the predicted value is a linear combination of the features, where the coefficients are stored in coef_ and the intercept is stored in intercept_.
Suppose you want to predict a house price using:
A linear regression model might learn something like:
Then it combines those effects into one numeric prediction.
Linear Regression is popular because it is:
It is often the first model people try when the target is numeric, because it gives clear coefficient-based explanations and trains quickly. Scikit-learn documents LinearRegression as ordinary least squares linear regression.
Uses one input feature.
Example:
Formula:
$\hat{y} = w_0 + w_1x$
Uses more than one input feature.
Example:
Formula:
$\hat{y} = w_0 + w_1x_1 + w_2x_2 + w_3x_3$
Both are handled by the same LinearRegression estimator in scikit-learn.
After training, a linear regression model gives you:
coef_: coefficients for each featureintercept_: the constant termScikit-learn’s linear model guide explains that coef_ stores the feature weights and intercept_ stores the independent term.
If a coefficient is positive:
If a coefficient is negative:
Example:
2500-800This means:
This interpretation works best when the model assumptions are reasonably satisfied.
Scikit-learn states that LinearRegression fits a model by minimizing the residual sum of squares.
Residual =
actual value−predicted value\text{actual value} - \text{predicted value}actual value−predicted value
So the model tries to make the squared errors as small as possible.
Why square them?
This is why ordinary least squares is often abbreviated as OLS.
Linear Regression works best when:
It can still be useful as a baseline even when the relationship is not perfectly linear, because it is quick and interpretable.
In practical machine learning, Linear Regression is often used even when assumptions are not perfectly met. But it is still helpful to know the classic assumptions:
These assumptions are part of the standard statistical interpretation of linear regression. Scikit-learn focuses more on prediction than formal inference, but the linear structure of the model remains the same.
Scikit-learn’s LinearRegression is aimed at prediction, not full statistical inference like p-values or confidence intervals.
pip install numpy pandas matplotlib scikit-learnWe will use a synthetic dataset.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Create synthetic data
rng = np.random.RandomState(42)
X = 2 * rng.rand(200, 1)
y = 4 + 3 * X[:, 0] + rng.randn(200)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Build model
model = LinearRegression()
# Train
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))train_test_split is scikit-learn’s standard utility for splitting data into random train and test subsets, LinearRegression fits the least-squares model, and mean_squared_error plus r2_score are standard regression metrics.
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, label="Actual data")
plt.plot(X_test, y_pred, linewidth=2, label="Regression line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression Example")
plt.legend()
plt.show()
This lets you see the fitted straight-line relationship between the feature and the target.
A common workflow is to use a real regression dataset with multiple features.
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))
The LinearRegression estimator supports multiple features directly, and the train/test workflow should always split before any preprocessing to avoid leakage. Scikit-learn’s common pitfalls guide explicitly warns to split data before preprocessing steps.
coef_table = pd.DataFrame({
"Feature": X.columns,
"Coefficient": model.coef_
})
print(coef_table.sort_values(by="Coefficient", key=abs, ascending=False))
print("Intercept:", model.intercept_)
This is one of the biggest advantages of Linear Regression: it is easy to inspect and explain.
Scikit-learn defines mean_squared_error as the mean squared error regression loss.
Formula idea:
$MSE = \frac{1}{n}\sum (y - \hat{y})^2$
Interpretation:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)Scikit-learn provides root_mean_squared_error, added in version 1.4.
Interpretation:
from sklearn.metrics import root_mean_squared_error
rmse = root_mean_squared_error(y_test, y_pred)
print("RMSE:", rmse)Scikit-learn documents r2_score as the coefficient of determination, where the best possible score is 1.0, a constant-mean predictor gets 0.0 in the usual non-constant-target case, and values can be negative if the model is worse than that baseline.
Interpretation:
1.0 = perfect fit0.0 = as good as predicting the meanfrom sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R²:", r2)
You can also use Mean Absolute Error.
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print("MAE:", mae)MAE is often easier to interpret because it uses absolute errors instead of squared errors. Scikit-learn’s model evaluation guide includes multiple regression metrics for exactly these tradeoffs.
Scikit-learn’s train_test_split utility is the standard way to create training and test subsets.
The model should be trained on one part of the data and evaluated on separate unseen data. This gives a better estimate of real-world performance.
Bad practice:
Better practice:
Scikit-learn’s common pitfalls guide explicitly warns against data leakage and recommends splitting before preprocessing.
Plain LinearRegression does not require scaling to work. The least-squares solution is still valid without scaling. But scaling can help when:
This is a practical guideline based on how linear models and preprocessing work in scikit-learn. The OLS solution itself does not depend on distance geometry the way KNN or SVM does.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
pipeline = Pipeline([
("scaler", StandardScaler()),
("lr", LinearRegression())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print("R²:", r2_score(y_test, y_pred))Using a Pipeline keeps preprocessing and prediction together and helps avoid leakage mistakes. Scikit-learn recommends this style for safe workflows.
import pandas as pd
df = pd.read_csv("your_data.csv")X = df.drop("target", axis=1)
y = df["target"]from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)y_pred = model.predict(X_test)from sklearn.metrics import mean_squared_error, r2_score
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))coef_table = pd.DataFrame({
"Feature": X.columns,
"Coefficient": model.coef_
})
print(coef_table)
print("Intercept:", model.intercept_)This is the standard scikit-learn regression workflow built from train_test_split, LinearRegression, and regression metrics.
plt.figure(figsize=(7, 7))
plt.scatter(y_test, y_pred)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Actual vs Predicted")
plt.show()If the model fits well, the points should lie roughly around a diagonal trend.
Residuals are:
$\text{residual} = y_{\text{true}} - y_{\text{pred}}$
residuals = y_test - y_pred
plt.figure(figsize=(8, 6))
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()A good linear model often shows residuals scattered around zero without a strong visible pattern.
Linear Regression is a simple model. If the true relationship is strongly nonlinear, the model may underfit.
Signs:
In such cases, you may need:
This is a practical modeling inference, not a limitation specific to scikit-learn’s implementation.
If two or more features are highly correlated, the coefficients can become unstable and harder to interpret.
The model may still predict reasonably well, but coefficient interpretation becomes weaker.
This is one reason people often move from plain Linear Regression to regularized variants like Ridge or Lasso.
Linear Regression can be sensitive to outliers because OLS minimizes squared error, which heavily penalizes large residuals. That follows directly from scikit-learn’s description of minimizing residual sum of squares.
If outliers are a major issue, consider:
Scikit-learn’s linear model examples and guide place ordinary least squares next to regularized alternatives such as Ridge.
This tutorial is about plain Linear Regression, but it is useful to know where it fits in the larger family.
Linear Regression is strong because it is:
These strengths follow directly from the OLS model structure in scikit-learn’s linear model docs.
Its main limitations are:
These are standard implications of using a linear functional form.
Use it when:
Be cautious when:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)Scikit-learn’s examples show this exact basic usage pattern for ordinary least squares.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Example dataframe
df = pd.DataFrame({
"tv": [230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2],
"radio": [37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6],
"newspaper": [69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6],
"sales": [22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2]
})
X = df.drop("sales", axis=1)
y = df["sales"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))
This is a classic kind of business-style regression problem: predict a continuous outcome from numeric inputs.
Linear Regression is one of the most important machine learning algorithms for numeric prediction.
Its core idea is simple:
Scikit-learn defines LinearRegression as ordinary least squares regression and stores the learned weights in coef_ and intercept_.
The most important practical rules are:
These recommendations align with the current scikit-learn documentation and examples for linear models and regression metrics.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# X, y = your data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))
Train a LinearRegression model on a synthetic dataset and report MSE and R².
Use a real multi-feature regression dataset and inspect the learned coefficients.
Plot the regression line for a simple one-feature dataset.
Create an actual-vs-predicted plot and a residual plot.
Compare plain LinearRegression with a scaled pipeline version.