0) Introduction

Linear Regression is one of the most important and widely used machine learning algorithms for regression tasks. In scikit-learn, the main estimator is LinearRegression, which fits a linear model to minimize the residual sum of squares between the observed targets and the predictions made by the linear approximation.

1) What Linear Regression does

Linear Regression tries to model the relationship between one or more input features and a continuous target value using a straight-line formula.

In mathematical form:

$\hat{y} = w_0 + w_1x_1 + w_2x_2 + \dots + w_px_p$

Scikit-learn’s linear models guide describes this exactly: the predicted value is a linear combination of the features, where the coefficients are stored in coef_ and the intercept is stored in intercept_.

Example intuition

Suppose you want to predict a house price using:

size
number of rooms
age of the house

A linear regression model might learn something like:

larger size increases price
more rooms increase price
older age decreases price

Then it combines those effects into one numeric prediction.

2) Why Linear Regression is useful

Linear Regression is popular because it is:

simple
fast
easy to interpret
a strong baseline for many regression problems

It is often the first model people try when the target is numeric, because it gives clear coefficient-based explanations and trains quickly. Scikit-learn documents LinearRegression as ordinary least squares linear regression.

3) Simple linear regression vs multiple linear regression

Simple linear regression

Uses one input feature.

Example:

predict salary from years of experience

Formula:

$\hat{y} = w_0 + w_1x$

Multiple linear regression

Uses more than one input feature.

Example:

predict salary from years of experience, education level, and location score

Formula:

$\hat{y} = w_0 + w_1x_1 + w_2x_2 + w_3x_3$

Both are handled by the same LinearRegression estimator in scikit-learn.

4) What the coefficients mean

After training, a linear regression model gives you:

coef_: coefficients for each feature
intercept_: the constant term

Scikit-learn’s linear model guide explains that coef_ stores the feature weights and intercept_ stores the independent term.

Interpretation

If a coefficient is positive:

increasing that feature tends to increase the prediction

If a coefficient is negative:

increasing that feature tends to decrease the prediction

Example:

coefficient for size = 2500
coefficient for age = -800

This means:

every extra unit of size increases the prediction by about 2500
every extra unit of age decreases the prediction by about 800

This interpretation works best when the model assumptions are reasonably satisfied.

5) What “ordinary least squares” means

Scikit-learn states that LinearRegression fits a model by minimizing the residual sum of squares.

Residual =

actual value−predicted value\text{actual value} - \text{predicted value}actual value−predicted value

So the model tries to make the squared errors as small as possible.

Why square them?

negative and positive errors do not cancel out
large errors are penalized more strongly

This is why ordinary least squares is often abbreviated as OLS.

6) When Linear Regression works well

Linear Regression works best when:

the relationship is approximately linear
the target is continuous
the data is not dominated by severe outliers
the features contain meaningful signal

It can still be useful as a baseline even when the relationship is not perfectly linear, because it is quick and interpretable.

7) Main assumptions of Linear Regression

In practical machine learning, Linear Regression is often used even when assumptions are not perfectly met. But it is still helpful to know the classic assumptions:

linear relationship between inputs and target
independent observations
roughly constant error variance
limited multicollinearity among features
residuals roughly centered around zero

These assumptions are part of the standard statistical interpretation of linear regression. Scikit-learn focuses more on prediction than formal inference, but the linear structure of the model remains the same.

Important note

Scikit-learn’s LinearRegression is aimed at prediction, not full statistical inference like p-values or confidence intervals.

Part I — First Example

8) Install required libraries

pip install numpy pandas matplotlib scikit-learn

9) A simple Linear Regression example

We will use a synthetic dataset.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create synthetic data
rng = np.random.RandomState(42)
X = 2 * rng.rand(200, 1)
y = 4 + 3 * X[:, 0] + rng.randn(200)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Build model
model = LinearRegression()

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

train_test_split is scikit-learn’s standard utility for splitting data into random train and test subsets, LinearRegression fits the least-squares model, and mean_squared_error plus r2_score are standard regression metrics.

10) Plot the regression line

plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, label="Actual data")
plt.plot(X_test, y_pred, linewidth=2, label="Regression line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression Example")
plt.legend()
plt.show()

This lets you see the fitted straight-line relationship between the feature and the target.

Part II — A Real Dataset Example

11) Linear Regression on California housing-style tabular data

A common workflow is to use a real regression dataset with multiple features.

import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

The LinearRegression estimator supports multiple features directly, and the train/test workflow should always split before any preprocessing to avoid leakage. Scikit-learn’s common pitfalls guide explicitly warns to split data before preprocessing steps.

12) Inspect coefficients

coef_table = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_
})

print(coef_table.sort_values(by="Coefficient", key=abs, ascending=False))
print("Intercept:", model.intercept_)

This is one of the biggest advantages of Linear Regression: it is easy to inspect and explain.

Part III — Evaluation Metrics

13) Mean Squared Error (MSE)

Scikit-learn defines mean_squared_error as the mean squared error regression loss.

Formula idea:

$MSE = \frac{1}{n}\sum (y - \hat{y})^2$

Interpretation:

lower is better
punishes large errors strongly

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print("MSE:", mse)

14) Root Mean Squared Error (RMSE)

Scikit-learn provides root_mean_squared_error, added in version 1.4.

Interpretation:

same unit as the target
easier to interpret than MSE in many cases

from sklearn.metrics import root_mean_squared_error

rmse = root_mean_squared_error(y_test, y_pred)
print("RMSE:", rmse)

15) R² score

Scikit-learn documents r2_score as the coefficient of determination, where the best possible score is 1.0, a constant-mean predictor gets 0.0 in the usual non-constant-target case, and values can be negative if the model is worse than that baseline.

Interpretation:

1.0 = perfect fit
0.0 = as good as predicting the mean
negative = worse than that baseline

from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print("R²:", r2)

16) MAE

You can also use Mean Absolute Error.

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", mae)

MAE is often easier to interpret because it uses absolute errors instead of squared errors. Scikit-learn’s model evaluation guide includes multiple regression metrics for exactly these tradeoffs.

Part IV — Train/Test Split and Good Practice

17) Why train/test split matters

Scikit-learn’s train_test_split utility is the standard way to create training and test subsets.

The model should be trained on one part of the data and evaluated on separate unseen data. This gives a better estimate of real-world performance.

Bad practice:

fit on all the data
evaluate on the same data

Better practice:

split first
fit on training data
evaluate on test data

Scikit-learn’s common pitfalls guide explicitly warns against data leakage and recommends splitting before preprocessing.

Part V — Feature Scaling and Linear Regression

18) Does Linear Regression need scaling?

Plain LinearRegression does not require scaling to work. The least-squares solution is still valid without scaling. But scaling can help when:

features are on very different scales
you want easier coefficient comparison
you later move to regularized models like Ridge or Lasso

This is a practical guideline based on how linear models and preprocessing work in scikit-learn. The OLS solution itself does not depend on distance geometry the way KNN or SVM does.

19) Example with scaling in a pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LinearRegression())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print("R²:", r2_score(y_test, y_pred))

Using a Pipeline keeps preprocessing and prediction together and helps avoid leakage mistakes. Scikit-learn recommends this style for safe workflows.

Part VI — Multiple Linear Regression Workflow

20) Full workflow on a CSV file

Step 1: Load data

import pandas as pd

df = pd.read_csv("your_data.csv")

Step 2: Separate features and target

X = df.drop("target", axis=1)
y = df["target"]

Step 3: Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Step 4: Train

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

Step 5: Predict

y_pred = model.predict(X_test)

Step 6: Evaluate

from sklearn.metrics import mean_squared_error, r2_score

print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

Step 7: Interpret coefficients

coef_table = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_
})
print(coef_table)
print("Intercept:", model.intercept_)

This is the standard scikit-learn regression workflow built from train_test_split, LinearRegression, and regression metrics.

Part VII — Visual Diagnostics

21) Plot actual vs predicted

plt.figure(figsize=(7, 7))
plt.scatter(y_test, y_pred)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Actual vs Predicted")
plt.show()

If the model fits well, the points should lie roughly around a diagonal trend.

22) Residual plot

Residuals are:

$\text{residual} = y_{\text{true}} - y_{\text{pred}}$

residuals = y_test - y_pred

plt.figure(figsize=(8, 6))
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()

A good linear model often shows residuals scattered around zero without a strong visible pattern.

Part VIII — Common Problems

23) Underfitting

Linear Regression is a simple model. If the true relationship is strongly nonlinear, the model may underfit.

Signs:

low R²
visible curvature in residuals
poor predictions even on training data

In such cases, you may need:

polynomial features
tree-based methods
boosting methods
other nonlinear models

This is a practical modeling inference, not a limitation specific to scikit-learn’s implementation.

24) Multicollinearity

If two or more features are highly correlated, the coefficients can become unstable and harder to interpret.

The model may still predict reasonably well, but coefficient interpretation becomes weaker.

This is one reason people often move from plain Linear Regression to regularized variants like Ridge or Lasso.

25) Outliers

Linear Regression can be sensitive to outliers because OLS minimizes squared error, which heavily penalizes large residuals. That follows directly from scikit-learn’s description of minimizing residual sum of squares.

If outliers are a major issue, consider:

cleaning the data
robust regression methods
feature engineering
transformations

Part IX — Linear Regression vs Regularized Models

26) Ridge and Lasso context

Scikit-learn’s linear model examples and guide place ordinary least squares next to regularized alternatives such as Ridge.

LinearRegression

plain OLS
easiest to interpret
no explicit regularization

Ridge

adds L2 regularization
useful when coefficients are unstable

Lasso

adds L1 regularization
can drive some coefficients to zero

This tutorial is about plain Linear Regression, but it is useful to know where it fits in the larger family.

Part X — Strengths and Weaknesses

27) Strengths of Linear Regression

Linear Regression is strong because it is:

simple
fast
interpretable
a great baseline
easy to train on small and medium tabular datasets

These strengths follow directly from the OLS model structure in scikit-learn’s linear model docs.

28) Weaknesses of Linear Regression

Its main limitations are:

assumes a linear relationship
sensitive to outliers
can struggle with multicollinearity
may underfit nonlinear patterns

These are standard implications of using a linear functional form.

Part XI — Practical Advice

29) When should you use Linear Regression?

Use it when:

the target is numeric
you want a quick baseline
interpretability matters
the relationship may be close to linear

30) When should you avoid it?

Be cautious when:

the relationship is strongly nonlinear
outliers dominate the data
you need very flexible modeling
correlated features make interpretation unstable

31) A good default starting point

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

Scikit-learn’s examples show this exact basic usage pattern for ordinary least squares.

Part XII — Mini Project Example

32) Predicting sales from advertising spend

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Example dataframe
df = pd.DataFrame({
    "tv": [230.1, 44.5, 17.2, 151.5, 180.8, 8.7, 57.5, 120.2],
    "radio": [37.8, 39.3, 45.9, 41.3, 10.8, 48.9, 32.8, 19.6],
    "newspaper": [69.2, 45.1, 69.3, 58.5, 58.4, 75.0, 23.5, 11.6],
    "sales": [22.1, 10.4, 9.3, 18.5, 12.9, 7.2, 11.8, 13.2]
})

X = df.drop("sales", axis=1)
y = df["sales"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

This is a classic kind of business-style regression problem: predict a continuous outcome from numeric inputs.

Part XIII — Summary

33) What you should remember

Linear Regression is one of the most important machine learning algorithms for numeric prediction.

Its core idea is simple:

model the target as a linear combination of the features
fit coefficients that minimize squared error
use the resulting line or hyperplane for prediction

Scikit-learn defines LinearRegression as ordinary least squares regression and stores the learned weights in coef_ and intercept_.

The most important practical rules are:

always use a train/test split
evaluate with MSE, RMSE, MAE, and R²
inspect coefficients carefully
watch for outliers and multicollinearity
use Linear Regression as a strong baseline before moving to more complex models

These recommendations align with the current scikit-learn documentation and examples for linear models and regression metrics.

34) Final ready-to-use template

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# X, y = your data

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

35) Practice exercises

Exercise 1

Train a LinearRegression model on a synthetic dataset and report MSE and R².

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create synthetic data
rng = np.random.RandomState(42)
X = 2 * rng.rand(200, 1)
y = 4 + 3 * X[:, 0] + rng.randn(200)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Build model
model = LinearRegression()

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)
print("MSE:", mse)
print("R²:", r2)

Explanation

LinearRegression() creates the linear regression model.
fit() learns the best line from the training data.
predict() gives predicted values for the test set.
mean_squared_error() measures average squared prediction error.
r2_score() measures how well the model explains the target variation.

Exercise 2

Use a real multi-feature regression dataset and inspect the learned coefficients.

import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Build model
model = LinearRegression()

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("MSE:", mean_squared_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))

# Inspect coefficients
coef_table = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_
})

print("\nCoefficients:\n")
print(coef_table.sort_values(by="Coefficient", key=abs, ascending=False))
print("\nIntercept:", model.intercept_)

Explanation

this example uses a real dataset with several features
each coefficient shows how that feature influences the prediction
a positive coefficient increases the prediction
a negative coefficient decreases the prediction

Exercise 3

Plot the regression line for a simple one-feature dataset.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Create synthetic data
rng = np.random.RandomState(42)
X = 2 * rng.rand(100, 1)
y = 5 + 2.5 * X[:, 0] + rng.randn(100)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(X_test, y_test, label="Actual data")
plt.plot(X_test, y_pred, linewidth=2, label="Regression line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Simple Linear Regression Line")
plt.legend()
plt.show()

Explanation

the scatter plot shows the real data points
the line shows the model predictions
this helps visualize how linear regression fits a straight line to the data

Exercise 4

Create an actual-vs-predicted plot and a residual plot.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Create synthetic data
rng = np.random.RandomState(42)
X = 3 * rng.rand(150, 1)
y = 2 + 4 * X[:, 0] + rng.randn(150)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Actual vs Predicted plot
plt.figure(figsize=(7, 7))
plt.scatter(y_test, y_pred)
plt.xlabel("Actual values")
plt.ylabel("Predicted values")
plt.title("Actual vs Predicted")
plt.show()

# Residual plot
residuals = y_test - y_pred

plt.figure(figsize=(8, 6))
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()

Explanation

Actual vs Predicted plot

compares the real values with predicted values
a good model gives points close to a diagonal pattern

Residual plot

residuals are actual - predicted
a good model often shows residuals scattered around zero without a strong pattern

Exercise 5

Compare plain LinearRegression with a scaled pipeline version.

import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# Load dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Model 1: plain Linear Regression
model_plain = LinearRegression()
model_plain.fit(X_train, y_train)
y_pred_plain = model_plain.predict(X_test)

# Model 2: scaled pipeline
model_scaled = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LinearRegression())
])

model_scaled.fit(X_train, y_train)
y_pred_scaled = model_scaled.predict(X_test)

# Evaluate both
print("Plain Linear Regression")
print("MSE:", mean_squared_error(y_test, y_pred_plain))
print("R²:", r2_score(y_test, y_pred_plain))

print("\nScaled Pipeline Linear Regression")
print("MSE:", mean_squared_error(y_test, y_pred_scaled))
print("R²:", r2_score(y_test, y_pred_scaled))

Explanation

Plain Linear Regression

uses the raw feature values directly

Scaled pipeline version

first standardizes the features
then fits linear regression

Main goal

compare performance between the two approaches
in plain linear regression, scaling is often not required for prediction quality, but it can still be useful for coefficient comparison and for later use with regularized models

Final summary

What each exercise teaches

Exercise 1: basic linear regression training and evaluation
Exercise 2: interpreting coefficients on a real dataset
Exercise 3: plotting the regression line
Exercise 4: visual model diagnostics with actual/predicted and residual plots
Exercise 5: comparing plain linear regression with a scaled pipeline