🧠 Predicting Medical Insurance Prices Using Machine Learning in Python | by Keshav

In this project, we developed a predictive model that estimates the medical insurance charges an individual might incur, based on various features such as age, sex, BMI, number of children, smoking status, and region. This is a classic regression problem, solved using Python’s robust machine learning libraries.

The dataset contains 1338 records with the following columns:

age: Age of the primary beneficiary
sex: Gender (male/female)
BMI: Body mass index
children: Number of dependents
smoker: Smoking status (yes/no)
region: Residential area in the US
charges: Target variable – individual medical costs billed by health insurance

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")#load data
df = pd.read_csv("insurance.csv")

We explored data types, null values, and performed exploratory data analysis (EDA). Then, we converted categorical variables using Label Encoding and One-Hot Encoding:

#convert categorical variables
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df['smoker'] = df['smoker'].map({'yes': 1, 'no': 0})
df = pd.get_dummies(df, columns=['region'], drop_first=True)

We used seaborn and matplotlib to understand how features influence insurance charges.

#distribution of charges
sns.distplot(df['charges'])

Key observations:

Charges are right-skewed.
Smokers have significantly higher charges.
Higher BMI and age correlate positively with charges.

We split the dataset and trained multiple regression models.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_scoreX = df.drop('charges', axis=1)
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

✅ Performance Metrics:

print("R² Score:", r2_score(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Before diving into model training, we explored the dataset visually to understand key drivers of medical insurance charges. Here’s what we discovered:

Categorical Feature Distribution — Pie Charts
(sex, smoker, region)

Used pie charts to represent the share of different categories.

Findings:

Gender distribution is fairly balanced.
Smokers are in the minority but are significant from a cost perspective.
All four regions are reasonably represented, avoiding regional bias.

2. Average Charges by Category — Bar Charts
(sex, children, smoker, region)

Visualized how average charges vary across categorical variables.

Findings:

Smokers pay significantly more, clearly the most impactful factor.
Slight variations by region and number of children.
Minimal charge difference between genders.

3. Age vs. Charges & BMI vs. Charges — Scatter Plots with Smoker Hue
(age, bmi vs charges)

Scatter plots colored by smoking status to reveal correlations.

Findings:

Age has a strong positive correlation with insurance cost.
Smokers with high BMI are often extreme outliers in terms of cost.
Smoking amplifies risk in both age and BMI categories.

4. Outlier Detection — Box Plots
(age, bmi)

Box plots to identify outliers in key numerical features.

Findings:

BMI shows several outliers likely representing obese individuals, important in risk profiling.
Age is more uniformly distributed, with few anomalies.

We also tested:

Random Forest Regressor
Gradient Boosting Regressor
XGBoost Regressor

Results show XGBoost performed best with the highest R² and lowest error.

Smoker status and age are strong predictors of insurance charges.
XGBoost provides the best performance among tried models.
The model can be deployed as a REST API or used in a web app for practical applications.

This project was a collaborative effort by:

Keshav Laddha (23UCS616)
Devansh Goyal (23UCS563)
Divyansh Chhabra (23UCS572)

Each member contributed significantly:

From data preprocessing to visualisation
From model development to documentation

Overall, it was a great experience that combined both technical rigor and team coordination.

🔗 GitHub Link–https://github.com/devanshgoyal001/medical_insurance_predictor

🧠 Predicting Medical Insurance Prices Using Machine Learning in Python | by Keshav | May, 2025

DeepSeek: China’s AI Power Play

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second

softbliss

Related Posts

Build a gen AI–powered financial assistant with Amazon Bedrock multi-agent collaboration

Updated production-ready Gemini models, reduced 1.5 Pro pricing, increased rate limits, and more

From a Point to L∞ | Towards Data Science

Context Serialization – O’Reilly

AI Pioneers Win Nobel Prizes for Physics and Chemistry

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second

Leave a Reply Cancel reply

Premium Content

AGI Is Not Here: LLMs Lack True Intelligence

The Importance of Gathering Evidence Following a Car Accident

Florida Virtual School Partners with University of Florida and Concord Consortium to Launch ‘Artificial Intelligence in Math’ Online Certification for Middle, High School Students

Browse by Category

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

🧠 Predicting Medical Insurance Prices Using Machine Learning in Python | by Keshav | May, 2025

✅ Performance Metrics:

DeepSeek: China’s AI Power Play

NVIDIA Open Sources Parakeet TDT 0.6B: Achieving a New Standard for Automatic Speech Recognition ASR and Transcribes an Hour of Audio in One Second

Related Posts

Leave a Reply Cancel reply

Premium Content

Browse by Category

Browse by Tags

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?