full code for this example at the bottom of this post.
Multiple regression is used when your response variable Y is continuous and you have at least k covariates, or independent variables that are linearly correlated with it. The data are of the form:
(Y₁, X₁), … ,(Yᵢ, Xᵢ), … ,(Yₙ, Xₙ)
where Xᵢ = (Xᵢ₁, …, Xᵢₖ) is a vector of covariates and n is the number of observations. Here, Xi is the vector of k covariate values for the ith observation.
Understanding the Data
To make this concrete, imagine the following scenario:
You enjoy running and tracking your performance by recording the distance you run each day. Over 100 consecutive days, you collect four pieces of information:
- The distance you run,
- The number of hours you spent running,
- The number of hours you slept last night,
- And the number of hours you worked
Now, on the 101st day, you recorded everything except the distance you ran. You want to estimate that missing value using the information you do have: the number of hours you spent running, the number of hours you slept the night before, and the number of hours you worked on that day.
To do this, you can rely on the data from the previous 100 days, which takes the form:
(Y₁, X₁), … , (Yᵢ, Xᵢ), … , (Y₁₀₀, X₁₀₀)
Here, each Yᵢ is the distance you ran on day i, and each covariate vector Xᵢ = (Xᵢ₁, Xᵢ₂, Xᵢ₃) corresponds to:
- Xᵢ₁: number of hours spent running,
- Xᵢ₂: number of hours slept the previous night,
- Xᵢ₃: number of hours worked on that day.
The index i = 1, …, 100 refers to the 100 days with complete data. With this dataset, you can now fit a multiple linear regression model to estimate the missing response variable for day 101.
Specification of the model
If we assume the linear relationship between the response variable and the covariates, which you can measure using the Pearson correlation, we can specify the model as:

for i = 1, …, n where E(ϵᵢ | Xᵢ₁, … , Xᵢₖ). To take into account the intercept, the first variable is set to Xᵢ₁ = 1, for i =1, …, n. To estimate the coefficient, the model is expressed in matrix notation.

And the covariates will be denoted by:


Then, we can rewrite the model as:
Y = Xβ + ε
Estimation of coefficients
Assuming that the (k+1)*(k+1) matrix is invertible, the form of the least squares estimate is given by:

We can derive the estimate of the regression function, an unbiased estimate of σ², and an approximate 1−α confidence interval for βⱼ:
- Estimate of the regression function: r(x) = ∑ⱼ₌₁ᵏ βⱼ xⱼ
- σ̂² = (1 / (n − k)) × ∑ᵢ₌₁ⁿ ε̂ᵢ² where ϵ̂ = Y − Xβ̂ is the vector of residuals.
- And β̂ⱼ ± tₙ₋ₖ,₁₋α⁄₂ × SE(β̂ⱼ) is an approximate (1 − α) confidence interval. Where SE(β̂ⱼ) is the jth diagonal element of the matrix σ̂² (Xᵀ X)⁻¹
Example of application
Because we did not record the data of our running performance, we will use a crime dataset from 47 states in 1960 that can be obtained from here. Before we fit a linear regression, there are many steps we must follow.
Understanding different variables of the data.
The first 9 observations of the data are given by:
R Age S Ed Ex0 Ex1 LF M N NW U1 U2 W X
79.1 151 1 91 58 56 510 950 33 301 108 41 394 261
163.5 143 0 113 103 95 583 1012 13 102 96 36 557 194
57.8 142 1 89 45 44 533 969 18 219 94 33 318 250
196.9 136 0 121 149 141 577 994 157 80 102 39 673 167
123.4 141 0 121 109 101 591 985 18 30 91 20 578 174
68.2 121 0 110 118 115 547 964 25 44 84 29 689 126
96.3 127 1 111 82 79 519 982 4 139 97 38 620 168
155.5 131 1 109 115 109 542 969 50 179 79 35 472 206
85.6 157 1 90 65 62 553 955 39 286 81 28 421 239
The data has 14 continuous variables (the response variable R, the 12 predictor variables, and one categorical variable S):
- R: Crime rate: # of offenses reported to police per million population
- Age: The number of males of age 14–24 per 1000 population
- S: Indicator variable for Southern states (0 = No, 1 = Yes)
- Ed: Mean # of years of schooling x 10 for persons of age 25 or older
- Ex0: 1960 per capita expenditure on police by state and local government
- Ex1: 1959 per capita expenditure on police by state and local government
- LF: Labor force participation rate per 1000 civilian urban males age 14–24
- M: The number of males per 1000 females
- N: State population size in hundred thousands
- NW: The number of non-whites per 1000 population
- U1: Unemployment rate of urban males per 1000 of age 14–24
- U2: Unemployment rate of urban males per 1000 of age 35–39
- W: Median value of transferable goods and assets or family income in tens of $
- X: The number of families per 1000 earning below 1/2 the median income
The data does not have missing values.
Graphical analysis of the relationship between the covariates X and the response variable Y
Graphical analysis of the relationship between explanatory variables and the response variable is a step when performing linear regression.
It helps visualize linear trends, detect anomalies, and assess the relevance of variables before building any model.

Some variables are positively correlated with the crime rate, while others are negatively correlated.
For instance, we observe a strong positive relationship between R (the crime rate) and Ex1.
In contrast, age appears to be negatively correlated with crime.
Finally, the boxplot of the binary variable S (indicating region: North or South) suggests that the crime rate is relatively similar between the two regions. Then, we can analyse the correlation matrix.
Heatmap of Pearson correlation matrix
The correlation matrix allows us to study the strength of the relationship between variables. While the Pearson correlation is commonly used to measure linear relationships, the Spearman Correlation is more appropriate when we want to capture monotonic, potentially non-linear relationships between variables.
In this analysis, we will use the Spearman correlation to better account for such non-linear associations.

The first row of the correlation matrix shows the strength of the relationship between each covariate and the response variable R.
For example, Ex0 and Ex1 both show a correlation greater than 60% with R, indicating a strong association. These variables appear to be good predictors of the crime rate.
However, since the correlation between Ex0 and Ex1 is nearly perfect, they likely convey similar information. To avoid redundancy, we can select just one of them, preferably the one with the strongest correlation with R.
When several variables are strongly correlated with each other (a correlation of 60%, for example), they tend to carry redundant information. In such cases, we keep only one of them — the one that is most strongly correlated with the response variable R. This allow us to reduce multicollinearity.
This exercise allows us to select these variables : [‘Ex1’, ‘LF’, ‘M’, ’N’, ‘NW’, ‘U2’].
Study of multicollinearity using the VIF (Variance Inflation Factors)
Before fitting the logistic regression, it is important to study the multicollinearity.
When correlation exists among predictors, the standard errors of the coefficient estimates increase, leading to an inflation of their variances. The Variance Inflation Factor (VIF) is a diagnostic tool used to measure how much the variance of a predictor’s coefficient is inflated due to multicollinearity, and it is typically provided in the regression output under a “VIF” column.

This VIF is calculated for each predictor in the model. The approach is to regress the i-th predictor variable against all the other predictors. We then obtain Rᵢ², which can be used to compute the VIF using the formula:

The table below presents the VIF values for the six remaining variables, all of which are below 5. This indicates that multicollinearity is not a concern, and we can proceed with fitting the linear regression model.

Fitting a linear regression on six variables
If we fit a linear regression of crime rate on 10 variables, we get the following:

Diagnosis of residuals
Before interpreting the regression results, we must first assess the quality of the residuals, particularly by checking for autocorrelation, homoscedasticity (constant variance), and normality. The diagnostic of residuals is given by the table below:

- The Durbin-Watson ≈2 indicates no autocorrelation in residuals.
- From the omnibus to Kurtosis, all values show that the residuals are symmetric and have a normal distribution.
- The low condition number (3.06) confirms that there is no multicollinearity among the predictors.
Main Points to Remember
We can also assess the overall quality of the model through indicators such as the R-squared and F-statistic, which show satisfactory results in this case. (See the appendix for more details.)
We can now interpret the regression coefficients from a statistical perspective.
We intentionally exclude any business-specific interpretation of the results.
The objective of this analysis is to illustrate a few simple and essential steps for modeling a problem using multiple linear regression.
At the 5% significance level, two coefficients are statistically significant: Ex1 and NW.
This is not surprising, as these were the two variables that showed a correlation greater than 40% with the response variable R. Variables that are not statistically significant may be removed or re-evaluated, or retained, depending on the study’s context and objectives.
This post gives you some guidelines to perform linear regression:
- It is important to check linearity through graphical analysis and to study the correlation between the response variable and the predictors.
- Examining correlations among variables helps reduce multicollinearity and supports variable selection.
- When two predictors are highly correlated, they may convey redundant information. In such cases, you can retain the one that is more strongly correlated with the response, or — based on domain expertise — the one with greater business relevance or practical interpretability.
- The Variance Inflation Factor (VIF) is a useful tool to quantify and assess multicollinearity.
- Before interpreting the model coefficients statistically, it is essential to verify the autocorrelation, normality, and homoscedasticity of the residuals to ensure that the model assumptions are met.
While this analysis provides valuable insights, it also has certain limitations.
The absence of missing values in the dataset simplifies the study, but this is rarely the case in real-world scenarios.
If you’re building a predictive model, it’s important to split the data into training, testing, and potentially an out-of-time validation set to ensure robust evaluation.
For variable selection, techniques such as stepwise selection and other feature selection methods can be applied.
When comparing multiple models, it’s essential to define appropriate performance metrics.
In the case of linear regression, commonly used metrics include the Mean Absolute Error (MAE) and the Mean Squared Error (MSE).
Image Credits
All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.
References
Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.
Data & Licensing
The dataset used in this article contains crime-related and demographic statistics for 47 U.S. states in 1960.
It originates from the FBI’s Uniform Crime Reporting (UCR) Program and additional U.S. government sources.
As a U.S. government work, the data is in the public domain under 17 U.S. Code § 105 and is free to use, share, and reproduce without restriction.
Sources:
Codes
Import data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.read_csv('data/Multiple_Regression_Dataset.csv')
df.head()
Visual Analysis of the Variables
Create a new figure
# Extract response variable and covariates
response = 'R'
covariates = [col for col in df.columns if col != response]
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(20, 18))
axes = axes.flatten()
# Plot boxplot for binary variable 'S'
sns.boxplot(data=df, x='S', y='R', ax=axes[0])
axes[0].set_title('Boxplot of R by S')
axes[0].set_xlabel('S')
axes[0].set_ylabel('R')
# Plot regression lines for all other covariates
plot_index = 1
for cov in covariates:
if cov != 'S':
sns.regplot(data=df, x=cov, y='R', ax=axes[plot_index], scatter=True, line_kws={"color": "red"})
axes[plot_index].set_title(f'{cov} vs R')
axes[plot_index].set_xlabel(cov)
axes[plot_index].set_ylabel('R')
plot_index += 1
# Hide unused subplots
for i in range(plot_index, len(axes)):
fig.delaxes(axes[i])
fig.tight_layout()
plt.show()
Analysis of the correlation between variables
spearman_corr = df.corr(method='spearman')
plt.figure(figsize=(12, 10))
sns.heatmap(spearman_corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()
Filtering Predictors with High Intercorrelation (ρ > 0.6)
# Step 2: Correlation of each variable with response R
spearman_corr_with_R = spearman_corr['R'].drop('R') # exclude R-R
# Step 3: Identify pairs of covariates with strong inter-correlation (e.g., > 0.9)
strong_pairs = []
threshold = 0.6
covariates = spearman_corr_with_R.index
for i, var1 in enumerate(covariates):
for var2 in covariates[i+1:]:
if abs(spearman_corr.loc[var1, var2]) > threshold:
strong_pairs.append((var1, var2))
# Step 4: From each correlated pair, keep only the variable most correlated with R
to_keep = set()
to_discard = set()
for var1, var2 in strong_pairs:
if abs(spearman_corr_with_R[var1]) >= abs(spearman_corr_with_R[var2]):
to_keep.add(var1)
to_discard.add(var2)
else:
to_keep.add(var2)
to_discard.add(var1)
# Final selection: all covariates excluding the ones to discard due to redundancy
final_selected_variables = [var for var in covariates if var not in to_discard]
final_selected_variables
Analysis of multicollinearity using VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.preprocessing import StandardScaler
X = df[final_selected_variables]
X_with_const = add_constant(X)
vif_data = pd.DataFrame()
vif_data["variable"] = X_with_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_with_const.values, i)
for i in range(X_with_const.shape[1])]
vif_data = vif_data[vif_data["variable"] != "const"]
print(vif_data)
Fit a linear regression model on six variables after standardization, not splitting the data into train and test
from sklearn.preprocessing import StandardScaler
from statsmodels.api import OLS, add_constant
import pandas as pd
# Variables
X = df[final_selected_variables]
y = df['R']
scaler = StandardScaler()
X_scaled_vars = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled_vars, columns=final_selected_variables)
X_scaled_df = add_constant(X_scaled_df)
model = OLS(y, X_scaled_df).fit()
print(model.summary())
