In this project, we developed a predictive model that estimates the medical insurance charges an individual might incur, based on various features such as age, sex, BMI, number of children, smoking status, and region. This is a classic regression problem, solved using Python’s robust machine learning libraries.
The dataset contains 1338 records with the following columns:
age
: Age of the primary beneficiarysex
: Gender (male/female)BMI
: Body mass indexchildren
: Number of dependentssmoker
: Smoking status (yes/no)region
: Residential area in the UScharges
: Target variable – individual medical costs billed by health insurance
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")#load data
df = pd.read_csv("insurance.csv")
We explored data types, null values, and performed exploratory data analysis (EDA). Then, we converted categorical variables using Label Encoding and One-Hot Encoding:
#convert categorical variables
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['smoker'] = df['smoker'].map({'yes': 1, 'no': 0})
df = pd.get_dummies(df, columns=['region'], drop_first=True)
We used seaborn
and matplotlib
to understand how features influence insurance charges.
#distribution of charges
sns.distplot(df['charges'])
Key observations:
- Charges are right-skewed.
- Smokers have significantly higher charges.
- Higher BMI and age correlate positively with charges.
We split the dataset and trained multiple regression models.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_scoreX = df.drop('charges', axis=1)
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
✅ Performance Metrics:
print("R² Score:", r2_score(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Before diving into model training, we explored the dataset visually to understand key drivers of medical insurance charges. Here’s what we discovered:
- Categorical Feature Distribution — Pie Charts
(sex
,smoker
,region
)
- Used pie charts to represent the share of different categories.
Findings:
- Gender distribution is fairly balanced.
- Smokers are in the minority but are significant from a cost perspective.
- All four regions are reasonably represented, avoiding regional bias.
2. Average Charges by Category — Bar Charts
(sex
, children
, smoker
, region
)
- Visualized how average charges vary across categorical variables.
Findings:
- Smokers pay significantly more, clearly the most impactful factor.
- Slight variations by region and number of children.
- Minimal charge difference between genders.
3. Age vs. Charges & BMI vs. Charges — Scatter Plots with Smoker Hue
(age
, bmi
vs charges
)
- Scatter plots colored by smoking status to reveal correlations.
Findings:
- Age has a strong positive correlation with insurance cost.
- Smokers with high BMI are often extreme outliers in terms of cost.
- Smoking amplifies risk in both age and BMI categories.
4. Outlier Detection — Box Plots
(age
, bmi
)
- Box plots to identify outliers in key numerical features.
Findings:
- BMI shows several outliers likely representing obese individuals, important in risk profiling.
- Age is more uniformly distributed, with few anomalies.
We also tested:
- Random Forest Regressor
- Gradient Boosting Regressor
- XGBoost Regressor
Results show XGBoost performed best with the highest R² and lowest error.
- Smoker status and age are strong predictors of insurance charges.
- XGBoost provides the best performance among tried models.
- The model can be deployed as a REST API or used in a web app for practical applications.
This project was a collaborative effort by:
- Keshav Laddha (23UCS616)
- Devansh Goyal (23UCS563)
- Divyansh Chhabra (23UCS572)
Each member contributed significantly:
- From data preprocessing to visualisation
- From model development to documentation
Overall, it was a great experience that combined both technical rigor and team coordination.
.
🔗 GitHub Link–https://github.com/devanshgoyal001/medical_insurance_predictor