Question 3: Can Life Expectancy be predicted from Income?
This question will be used to demonstrate classification. Since there are no categorical values present in the data set the life expectancy is used to create two classes: high and low countries, the threshold being the 75th percentil = 77.4 years. It should be predicted if a country is in the high or low class. First create a descriptive statistic using the Pandas DataFrame describe method:
df = pd.read_csv("worldbank_development.csv", sep=',')
df = df.dropna(how='any',axis=0)
life_exp = df[df["Life expectancy at birth, total (years) [SP.DYN.LE00.IN]"] != ".."]
life_exp = life_exp["Life expectancy at birth, total (years) [SP.DYN.LE00.IN]"]
life_exp.astype(float).describe()
Output:
count 265.000000
mean 72.175783
std 7.216460
min 50.596000
25% 66.924000
50% 72.647321
75% 77.449874
max 86.089000
Name: Life expectancy at birth, total (years) [SP.DYN.LE00.IN], dtype: float64
As a classifier I used DecisionTreeClassifier from scikit-learn. When looking at a scatter plot it is clear that income and life expectancy correlate very strong. This is especially true for low income countries (below 30,000$). For countries above 50,000$ GNI the correlation is nearly zero. So further increasing the income has nearly no effect on the Life expectancy for those countries:
When evaluating the fit from the model by using a confusion matrix the results are good to very good although there are no two distinct clusters in this scatter plot. To evaluate this further metrics like accuracy (correct predictions divided by total predictions), precision (reliability of positive predictions) and recall (also known as sensitivity) can be used. The F1 score combines those two. Extra care has to be taken when one of the two groups is far more prevalent than the other. This can happen for example for rare diseases like HIV. In this case labeling all results as negative may still seems to be very accurate since in for example 99% the cases this is correct. For this reason metrics like the precision is needed.