Spurious correlations¶

Examples in this notebook are taken from https://www.tylervigen.com/spurious-correlations, which also has links to the original data sources.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

The following tables record four series of numbers measured annually between 2012 and 2021.

In [2]:

headers = [
 "Bachelor's degrees awarded in Psychology",
 "The number of groundkeepers in Utah",
 "Number of babies called Georgia",
 "Fossil fuel use in Somalia (TWh)"
]

short_headers = ["Psychology","Groundkeepers","Georgia","Somalia"]

data = np.array([
[2012,2013,2014,2015,2016,2017,2018,2019,2020,2021],
[109099,114446,117312,117573,117447,116859,116436,116550,119989,126944],
[8070,9200,9900,9800,9640,9450,9510,9510,9990,11400],
[1067,1264,1333,1438,1417,1319,1421,1477,1611,1739],
[0.32,0.33,0.33,0.33,0.33,0.33,0.33,0.329,0.35,0.3724]
]).T

With a little massage we can convert them into a pandas dataframe.

In [3]:

df = pd.DataFrame(data[:,1:],columns=short_headers,index=data[:,0].astype(int))

df

Out[3]:

	Psychology	Groundkeepers	Georgia	Somalia
2012	109099.0	8070.0	1067.0	0.3200
2013	114446.0	9200.0	1264.0	0.3300
2014	117312.0	9900.0	1333.0	0.3300
2015	117573.0	9800.0	1438.0	0.3300
2016	117447.0	9640.0	1417.0	0.3300
2017	116859.0	9450.0	1319.0	0.3300
2018	116436.0	9510.0	1421.0	0.3300
2019	116550.0	9510.0	1477.0	0.3290
2020	119989.0	9990.0	1611.0	0.3500
2021	126944.0	11400.0	1739.0	0.3724

It turns out that these time series are strongly correlated: the correlation coefficients in the table below are all close to $1$. As explained at https://www.tylervigen.com/spurious-correlations, these correlations do not really mean anything. Nonetheless, that site will give you a fake AI-generated explanation for each correlation if you ask for one.

In [4]:

df.corr()

Out[4]:

	Psychology	Groundkeepers	Georgia	Somalia
Psychology	1.000000	0.989845	0.939223	0.911298
Groundkeepers	0.989845	1.000000	0.912990	0.875781
Georgia	0.939223	0.912990	1.000000	0.856114
Somalia	0.911298	0.875781	0.856114	1.000000

The next cell generates scatter plots to illustrate the above correlations. If the correlations were perfect, all the points would lie on a straight line running from the bottom left to the top right.

In [4]:

u = [[[0,1],[0,2],[0,3]],[[1,2],[1,3],[2,3]]]
fig, ax = plt.subplots(2,3,figsize=(12,8))
for p in range(2):
    for q in range(3):
        i,j = u[p][q]
        ax[p,q].set_xticks([])
        ax[p,q].set_yticks([])
        ax[p,q].scatter(df[short_headers[i]],df[short_headers[j]])
        ax[p,q].set_xlabel(short_headers[i])
        ax[p,q].set_ylabel(short_headers[j])

No description has been provided for this image

Our four series of numbers have very different scales: the numbers in the first column are close to $10^5$, but those in the last column are close to $0.3$, for example. To compare the numbers, it is best to normalise: we subtract the mean of each column and then divide by the standard deviation to get new columns in which the mean is zero and the standard deviation is one. We make a new dataframe called dfn with these normalised columns.

We then plot the normalised values against time, and observe that the columns are quite similar.

In [5]:

dfn = pd.DataFrame(index = df.index)
for h in short_headers:
    dfn[h + ' (normalised)'] = (df[h] - df[h].mean()) / df[h].std()

dfn.plot()

Out[5]:

<Axes: >

All the above curves look quite similar to cubic polynomials, so we now try to approximate them by such polynomials. The general form of a cubic is given by the function p(x, a0, a1, a2, a3) below. Our task is to find the best values for the parameters a0, a1, a2, a3. Here the obvious thing would be to let x be the number of the year, running from 2012 to 2021. However, if we did that then the values of $x^3$ would be rather large, which would lead to problems with overflow and inaccurate cancellation. We will therefore subtract 2010 so that $x$ runs from 2 to 19.

In [6]:

def p(x, a0, a1, a2, a3):
    return a0 + a1*x + a2*x**2 + a3*x**3

We now use the curve_fit() function from scipy.optimize to fit a cubic polynomial to each of the four time series, and we plot each series along with its cubic approximation, using a different colour for each series.

In [7]:

colours = ['r', 'g', 'b', 'y']
a = np.zeros((4, 4))
for i in range(4):
    ts0 = dfn.index.to_numpy().astype(float) - 2010
    ys0 = dfn.iloc[:,i].to_numpy()
    ts = np.linspace(2012, 2021, 100) - 2010
    a[i] = curve_fit(p, ts0, ys0, p0=[0,0,0,0])[0]
    ys = p(ts, *a[i])
    plt.plot(ts0, ys0, 'o', color = colours[i])
    plt.plot(ts, ys, color = colours[i])

In [ ]: