World Bank Data

In this notebook we will model world development, inspired by the presentations of Hans Rosling on 200 years that changed the world.

We will go through the steps of downloading data from an Application Programming Interface (API), plotting it, fitting a linear regression model (with non-linear terms) and then using a model to make future predictions.

First we use the World Bank API in order to download the data.

#Import libraries
#Plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
#Dataframes (panda) and math opperations (numpy)
import pandas as pd
import numpy as np
#World bank data interface
import world_bank_data as wb
#Import machine learning tools (for linear regression)
import sklearn.linear_model as skl_lm
import itertools
import math

# Default plotting options.
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False

Now we download the data set, indexed with the country code.

The data is downloaded from here here.

gdp_percapita = wb.get_series('NY.GNP.PCAP.CD', id_or_value='id', simplify_index=True)
gdp_percapita
childmortality = wb.get_series('SH.DYN.MORT', id_or_value='id', simplify_index=True)
childmortality
Country  Year
AFE      1960     NaN
         1961     NaN
         1962     NaN
         1963     NaN
         1964     NaN
                 ...
ZWE      2018    53.7
         2019    52.7
         2020    51.8
         2021    49.5
         2022     NaN
Name: SH.DYN.MORT, Length: 16758, dtype: float64

Now we reorganise the data so it is in a dataframe (like an Excel spreadsheet) with columns Year, Country, Child Mortality (per 10,000 births) (\(C(k)\)), log Gross Domestic Product per person (\(G(k)\)) and change in these two variables — \(C(k+1)-C(k)\) and \(G(k+1)-G(k)\) — over time \(k\).

(This can take one minute or so to run)

cm=childmortality.to_frame()
gdp=gdp_percapita.to_frame()

df = pd.DataFrame(columns=('Year', 'Country','Child Mortality','GDP','Diff CM','Diff GDP'))
countries = cm.index
this_country=''
this_cm=0
this_gdp=0
j=0

for i,c in enumerate(countries):


    prev_cm=this_cm
    prev_gdp=this_gdp
    prev_country=this_country

    #Update for this year.
    this_year=c[1]
    this_cm=cm.iloc[i]['SH.DYN.MORT']
    this_gdp=np.log(gdp.iloc[i]['NY.GNP.PCAP.CD'])
    this_country=c[0]

    if (prev_country==this_country):
        #Update differences and put in matrix
        diff_cm=this_cm-prev_cm
        diff_gdp=this_gdp-prev_gdp
        if not(np.isnan(diff_cm) or np.isnan(diff_gdp)):
            df.loc[j] = [this_year,this_country,this_cm,this_gdp,diff_cm,diff_gdp]
            j=j+1

df.head()
#df.to_csv('../data/CM_GDP.csv')
Year Country Child Mortality GDP Diff CM Diff GDP
0 1991 AFE 164.161055 6.620620 -1.671507 0.052054
1 1992 AFE 162.695217 6.554938 -1.465838 -0.065682
2 1993 AFE 161.379202 6.544833 -1.316014 -0.010106
3 1994 AFE 161.983963 6.530047 0.604761 -0.014786
4 1995 AFE 156.580561 6.579397 -5.403402 0.049351


By uncommenting the line above you save the data set to a directory called data.

Total running time of the script: ( 0 minutes 16.906 seconds)

Gallery generated by Sphinx-Gallery