Multiple Linear Regression Analysis | Towards Data Science

full code for this example at the bottom of this post.

Multiple regression is used when your response variable Y is continuous and you have at least k covariates, or independent variables that are linearly correlated with it. The data are of the form:

(Y₁, X₁), … ,(Yᵢ, Xᵢ), … ,(Yₙ, Xₙ)

where Xᵢ = (Xᵢ₁, …, Xᵢₖ) is a vector of covariates and n is the number of observations. Here, Xi is the vector of k covariate values for the ith observation.

Understanding the Data

To make this concrete, imagine the following scenario:

You enjoy running and tracking your performance by recording the distance you run each day. Over 100 consecutive days, you collect four pieces of information:

The distance you run,
The number of hours you spent running,
The number of hours you slept last night,
And the number of hours you worked

Now, on the 101st day, you recorded everything except the distance you ran. You want to estimate that missing value using the information you do have: the number of hours you spent running, the number of hours you slept the night before, and the number of hours you worked on that day.

To do this, you can rely on the data from the previous 100 days, which takes the form:

(Y₁, X₁), … , (Yᵢ, Xᵢ), … , (Y₁₀₀, X₁₀₀)

Here, each Yᵢ is the distance you ran on day i, and each covariate vector Xᵢ = (Xᵢ₁, Xᵢ₂, Xᵢ₃) corresponds to:

Xᵢ₁: number of hours spent running,
Xᵢ₂: number of hours slept the previous night,
Xᵢ₃: number of hours worked on that day.

The index i = 1, …, 100 refers to the 100 days with complete data. With this dataset, you can now fit a multiple linear regression model to estimate the missing response variable for day 101.

Specification of the model

If we assume the linear relationship between the response variable and the covariates, which you can measure using the Pearson correlation, we can specify the model as:

Specification of linear regression model

for i = 1, …, n where E(ϵᵢ | Xᵢ₁, … , Xᵢₖ). To take into account the intercept, the first variable is set to Xᵢ₁ = 1, for i =1, …, n. To estimate the coefficient, the model is expressed in matrix notation.

And the covariates will be denoted by:

X is the **design matrix** (with an intercept and k covariates)

β is a column vector of coefficients, used in the linear regression model; ε is a column vector of random error terms, one for each observation.

Then, we can rewrite the model as:

Y = Xβ + ε

Estimation of coefficients

Assuming that the (k+1)*(k+1) matrix is invertible, the form of the least squares estimate is given by:

We can derive the estimate of the regression function, an unbiased estimate of σ², and an approximate 1−α confidence interval for βⱼ:

Estimate of the regression function: r(x) = ∑ⱼ₌₁ᵏ βⱼ xⱼ
σ̂² = (1 / (n − k)) × ∑ᵢ₌₁ⁿ ε̂ᵢ² where ϵ̂ = Y − Xβ̂ is the vector of residuals.
And β̂ⱼ ± tₙ₋ₖ,₁₋α⁄₂ × SE(β̂ⱼ) is an approximate (1 − α) confidence interval. Where SE(β̂ⱼ) is the jth diagonal element of the matrix σ̂² (Xᵀ X)⁻¹

Example of application

Because we did not record the data of our running performance, we will use a crime dataset from 47 states in 1960 that can be obtained from here. Before we fit a linear regression, there are many steps we must follow.

Understanding different variables of the data.

The first 9 observations of the data are given by:

 R	   Age	S	Ed	Ex0	Ex1	LF	M	N	NW	U1	U2	W	X
79.1	151	1	91	58	56	510	950	33	301	108	41	394	261
163.5	143	0	113	103	95	583	1012 13	102	96	36	557	194
57.8	142	1	89	45	44	533	969	18	219	94	33	318	250
196.9	136	0	121	149	141	577	994	157	80	102	39	673	167
123.4	141	0	121	109	101	591	985	18	30	91	20	578	174
68.2	121	0	110	118	115	547	964	25	44	84	29	689	126
96.3	127	1	111	82	79	519	982	4	139	97	38	620	168
155.5	131	1	109	115	109	542	969	50	179	79	35	472	206
85.6	157	1	90	65	62	553	955	39	286	81	28	421	239

The data has 14 continuous variables (the response variable R, the 12 predictor variables, and one categorical variable S):

R: Crime rate: # of offenses reported to police per million population
Age: The number of males of age 14–24 per 1000 population
S: Indicator variable for Southern states (0 = No, 1 = Yes)
Ed: Mean # of years of schooling x 10 for persons of age 25 or older
Ex0: 1960 per capita expenditure on police by state and local government
Ex1: 1959 per capita expenditure on police by state and local government
LF: Labor force participation rate per 1000 civilian urban males age 14–24
M: The number of males per 1000 females
N: State population size in hundred thousands
NW: The number of non-whites per 1000 population
U1: Unemployment rate of urban males per 1000 of age 14–24
U2: Unemployment rate of urban males per 1000 of age 35–39
W: Median value of transferable goods and assets or family income in tens of $
X: The number of families per 1000 earning below 1/2 the median income

The data does not have missing values.

Graphical analysis of the relationship between the covariates X and the response variable Y

Graphical analysis of the relationship between explanatory variables and the response variable is a step when performing linear regression.

It helps visualize linear trends, detect anomalies, and assess the relevance of variables before building any model.

**Box plots and scatter plots with fitted linear regression lines** illustrate the trend between each variable and R.

Some variables are positively correlated with the crime rate, while others are negatively correlated.

For instance, we observe a strong positive relationship between R (the crime rate) and Ex1.

In contrast, age appears to be negatively correlated with crime.

Finally, the boxplot of the binary variable S (indicating region: North or South) suggests that the crime rate is relatively similar between the two regions. Then, we can analyse the correlation matrix.

Heatmap of Pearson correlation matrix

The correlation matrix allows us to study the strength of the relationship between variables. While the Pearson correlation is commonly used to measure linear relationships, the Spearman Correlation is more appropriate when we want to capture monotonic, potentially non-linear relationships between variables.

In this analysis, we will use the Spearman correlation to better account for such non-linear associations.

A **heatmap of the correlation matrix** in Python

The first row of the correlation matrix shows the strength of the relationship between each covariate and the response variable R.

For example, Ex0 and Ex1 both show a correlation greater than 60% with R, indicating a strong association. These variables appear to be good predictors of the crime rate.

However, since the correlation between Ex0 and Ex1 is nearly perfect, they likely convey similar information. To avoid redundancy, we can select just one of them, preferably the one with the strongest correlation with R.

When several variables are strongly correlated with each other (a correlation of 60%, for example), they tend to carry redundant information. In such cases, we keep only one of them — the one that is most strongly correlated with the response variable R. This allow us to reduce multicollinearity.

This exercise allows us to select these variables : [‘Ex1’, ‘LF’, ‘M’, ’N’, ‘NW’, ‘U2’].

Study of multicollinearity using the VIF (Variance Inflation Factors)

Before fitting the logistic regression, it is important to study the multicollinearity.

When correlation exists among predictors, the standard errors of the coefficient estimates increase, leading to an inflation of their variances. The Variance Inflation Factor (VIF) is a diagnostic tool used to measure how much the variance of a predictor’s coefficient is inflated due to multicollinearity, and it is typically provided in the regression output under a “VIF” column.

This VIF is calculated for each predictor in the model. The approach is to regress the i-th predictor variable against all the other predictors. We then obtain Rᵢ², which can be used to compute the VIF using the formula:

The table below presents the VIF values for the six remaining variables, all of which are below 5. This indicates that multicollinearity is not a concern, and we can proceed with fitting the linear regression model.

Fitting a linear regression on six variables

If we fit a linear regression of crime rate on 10 variables, we get the following:

Output of the Multiple Linear Regression Analysis. The corresponding code is provided in the appendix.

Diagnosis of residuals

Before interpreting the regression results, we must first assess the quality of the residuals, particularly by checking for autocorrelation, homoscedasticity (constant variance), and normality. The diagnostic of residuals is given by the table below:

Diagnosis of the residuals. Come to the summary of the regression

The Durbin-Watson ≈2 indicates no autocorrelation in residuals.
From the omnibus to Kurtosis, all values show that the residuals are symmetric and have a normal distribution.
The low condition number (3.06) confirms that there is no multicollinearity among the predictors.

Main Points to Remember

We can also assess the overall quality of the model through indicators such as the R-squared and F-statistic, which show satisfactory results in this case. (See the appendix for more details.)

We can now interpret the regression coefficients from a statistical perspective.
We intentionally exclude any business-specific interpretation of the results.
The objective of this analysis is to illustrate a few simple and essential steps for modeling a problem using multiple linear regression.

At the 5% significance level, two coefficients are statistically significant: Ex1 and NW.

This is not surprising, as these were the two variables that showed a correlation greater than 40% with the response variable R. Variables that are not statistically significant may be removed or re-evaluated, or retained, depending on the study’s context and objectives.

This post gives you some guidelines to perform linear regression:

It is important to check linearity through graphical analysis and to study the correlation between the response variable and the predictors.
Examining correlations among variables helps reduce multicollinearity and supports variable selection.
When two predictors are highly correlated, they may convey redundant information. In such cases, you can retain the one that is more strongly correlated with the response, or — based on domain expertise — the one with greater business relevance or practical interpretability.
The Variance Inflation Factor (VIF) is a useful tool to quantify and assess multicollinearity.
Before interpreting the model coefficients statistically, it is essential to verify the autocorrelation, normality, and homoscedasticity of the residuals to ensure that the model assumptions are met.

While this analysis provides valuable insights, it also has certain limitations.

The absence of missing values in the dataset simplifies the study, but this is rarely the case in real-world scenarios.

If you’re building a predictive model, it’s important to split the data into training, testing, and potentially an out-of-time validation set to ensure robust evaluation.

For variable selection, techniques such as stepwise selection and other feature selection methods can be applied.

When comparing multiple models, it’s essential to define appropriate performance metrics.

In the case of linear regression, commonly used metrics include the Mean Absolute Error (MAE) and the Mean Squared Error (MSE).

Image Credits

All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

References

Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.

Data & Licensing

The dataset used in this article contains crime-related and demographic statistics for 47 U.S. states in 1960.
It originates from the FBI’s Uniform Crime Reporting (UCR) Program and additional U.S. government sources.

As a U.S. government work, the data is in the public domain under 17 U.S. Code § 105 and is free to use, share, and reproduce without restriction.

Sources:

Codes

Import data

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('data/Multiple_Regression_Dataset.csv')
df.head()

Visual Analysis of the Variables

Create a new figure

# Extract response variable and covariates
response = 'R'
covariates = [col for col in df.columns if col != response]

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20, 18))
axes = axes.flatten()

# Plot boxplot for binary variable 'S'
sns.boxplot(data=df, x='S', y='R', ax=axes[0])
axes[0].set_title('Boxplot of R by S')
axes[0].set_xlabel('S')
axes[0].set_ylabel('R')

# Plot regression lines for all other covariates
plot_index = 1
for cov in covariates:
    if cov != 'S':
        sns.regplot(data=df, x=cov, y='R', ax=axes[plot_index], scatter=True, line_kws={"color": "red"})
        axes[plot_index].set_title(f'{cov} vs R')
        axes[plot_index].set_xlabel(cov)
        axes[plot_index].set_ylabel('R')
        plot_index += 1

# Hide unused subplots
for i in range(plot_index, len(axes)):
    fig.delaxes(axes[i])

fig.tight_layout()
plt.show()

Analysis of the correlation between variables

spearman_corr = df.corr(method='spearman')
plt.figure(figsize=(12, 10))
sns.heatmap(spearman_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()

Filtering Predictors with High Intercorrelation (ρ > 0.6)

# Step 2: Correlation of each variable with response R
spearman_corr_with_R = spearman_corr['R'].drop('R')  # exclude R-R

# Step 3: Identify pairs of covariates with strong inter-correlation (e.g., > 0.9)
strong_pairs = []
threshold = 0.6
covariates = spearman_corr_with_R.index

for i, var1 in enumerate(covariates):
    for var2 in covariates[i+1:]:
        if abs(spearman_corr.loc[var1, var2]) > threshold:
            strong_pairs.append((var1, var2))

# Step 4: From each correlated pair, keep only the variable most correlated with R
to_keep = set()
to_discard = set()

for var1, var2 in strong_pairs:
    if abs(spearman_corr_with_R[var1]) >= abs(spearman_corr_with_R[var2]):
        to_keep.add(var1)
        to_discard.add(var2)
    else:
        to_keep.add(var2)
        to_discard.add(var1)

# Final selection: all covariates excluding the ones to discard due to redundancy
final_selected_variables = [var for var in covariates if var not in to_discard]

final_selected_variables

Analysis of multicollinearity using VIF

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler

X = df[final_selected_variables]  

X_with_const = add_constant(X)  

vif_data = pd.DataFrame()
vif_data["variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
                   for i in range(X_with_const.shape[1])]

vif_data = vif_data[vif_data["variable"] != "const"]

print(vif_data)

Fit a linear regression model on six variables after standardization, not splitting the data into train and test

from sklearn.preprocessing import StandardScaler
from statsmodels.api import OLS, add_constant
import pandas as pd

# Variables
X = df[final_selected_variables]
y = df['R']

scaler = StandardScaler()
X_scaled_vars = scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled_vars, columns=final_selected_variables)

X_scaled_df = add_constant(X_scaled_df)

model = OLS(y, X_scaled_df).fit()
print(model.summary())

Image from author: OLS Regression Results

Multiple Linear Regression Analysis | Towards Data Science

Advancing Email Validation in Laravel

The Impact of AI on Architecture Careers

softbliss

Related Posts

Gemini as a universal AI assistant

Vibing at Home – O’Reilly

AI Summit: US Energy Secretary Highlights AI’s Role in Science, Energy and Security

Cubify Anything: Scaling Indoor 3D Object Detection

Evaluating RAG Pipelines

The Impact of AI on Architecture Careers

Leave a Reply Cancel reply

Premium Content

Hybrid AI model crafts smooth, high-quality videos in seconds | MIT News

5 Ways School Leaders Provide Stability in Uncertain Times

Language Models Reinforce Dialect Discrimination – The Berkeley Artificial Intelligence Research Blog

Browse by Category

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Multiple Linear Regression Analysis | Towards Data Science

Understanding the Data

Specification of the model

Estimation of coefficients

Example of application

Graphical analysis of the relationship between the covariates X and the response variable Y

Heatmap of Pearson correlation matrix

Study of multicollinearity using the VIF (Variance Inflation Factors)

Fitting a linear regression on six variables

Diagnosis of residuals

Main Points to Remember

Image Credits

References

Data & Licensing

Codes

Advancing Email Validation in Laravel

The Impact of AI on Architecture Careers

Related Posts

Leave a Reply Cancel reply

Premium Content

Browse by Category

Browse by Tags

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?