Spurious correlations¶
Examples in this notebook are taken from https://www.tylervigen.com/spurious-correlations, which also has links to the original data sources.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
The following tables record four series of numbers measured annually between 2012 and 2021.
headers = [
"Bachelor's degrees awarded in Psychology",
"The number of groundkeepers in Utah",
"Number of babies called Georgia",
"Fossil fuel use in Somalia (TWh)"
]
short_headers = ["Psychology","Groundkeepers","Georgia","Somalia"]
data = np.array([
[2012,2013,2014,2015,2016,2017,2018,2019,2020,2021],
[109099,114446,117312,117573,117447,116859,116436,116550,119989,126944],
[8070,9200,9900,9800,9640,9450,9510,9510,9990,11400],
[1067,1264,1333,1438,1417,1319,1421,1477,1611,1739],
[0.32,0.33,0.33,0.33,0.33,0.33,0.33,0.329,0.35,0.3724]
]).T
With a little massage we can convert them into a pandas dataframe.
df = pd.DataFrame(data[:,1:],columns=short_headers,index=data[:,0].astype(int))
df
| Psychology | Groundkeepers | Georgia | Somalia | |
|---|---|---|---|---|
| 2012 | 109099.0 | 8070.0 | 1067.0 | 0.3200 |
| 2013 | 114446.0 | 9200.0 | 1264.0 | 0.3300 |
| 2014 | 117312.0 | 9900.0 | 1333.0 | 0.3300 |
| 2015 | 117573.0 | 9800.0 | 1438.0 | 0.3300 |
| 2016 | 117447.0 | 9640.0 | 1417.0 | 0.3300 |
| 2017 | 116859.0 | 9450.0 | 1319.0 | 0.3300 |
| 2018 | 116436.0 | 9510.0 | 1421.0 | 0.3300 |
| 2019 | 116550.0 | 9510.0 | 1477.0 | 0.3290 |
| 2020 | 119989.0 | 9990.0 | 1611.0 | 0.3500 |
| 2021 | 126944.0 | 11400.0 | 1739.0 | 0.3724 |
It turns out that these time series are strongly correlated: the correlation coefficients in the table below are all close to $1$. As explained at https://www.tylervigen.com/spurious-correlations, these correlations do not really mean anything. Nonetheless, that site will give you a fake AI-generated explanation for each correlation if you ask for one.
df.corr()
| Psychology | Groundkeepers | Georgia | Somalia | |
|---|---|---|---|---|
| Psychology | 1.000000 | 0.989845 | 0.939223 | 0.911298 |
| Groundkeepers | 0.989845 | 1.000000 | 0.912990 | 0.875781 |
| Georgia | 0.939223 | 0.912990 | 1.000000 | 0.856114 |
| Somalia | 0.911298 | 0.875781 | 0.856114 | 1.000000 |
The next cell generates scatter plots to illustrate the above correlations. If the correlations were perfect, all the points would lie on a straight line running from the bottom left to the top right.
u = [[[0,1],[0,2],[0,3]],[[1,2],[1,3],[2,3]]]
fig, ax = plt.subplots(2,3,figsize=(12,8))
for p in range(2):
for q in range(3):
i,j = u[p][q]
ax[p,q].set_xticks([])
ax[p,q].set_yticks([])
ax[p,q].scatter(df[short_headers[i]],df[short_headers[j]])
ax[p,q].set_xlabel(short_headers[i])
ax[p,q].set_ylabel(short_headers[j])
Our four series of numbers have very different scales: the numbers in the first column are close to $10^5$, but those in the last column are close to $0.3$, for example. To compare the numbers, it is best to normalise: we subtract the mean of each column and then divide by the standard deviation to get new columns in which the mean is zero and the standard deviation is one. We make a new dataframe called dfn with these normalised columns.
We then plot the normalised values against time, and observe that the columns are quite similar.
dfn = pd.DataFrame(index = df.index)
for h in short_headers:
dfn[h + ' (normalised)'] = (df[h] - df[h].mean()) / df[h].std()
dfn.plot()
<Axes: >
All the above curves look quite similar to cubic polynomials, so we now try to approximate them by such polynomials. The general form of a cubic is given by the function p(x, a0, a1, a2, a3) below. Our task is to find the best values for the parameters a0, a1, a2, a3. Here the obvious thing would be to let x be the number of the year, running from 2012 to 2021. However, if we did that then the values of $x^3$ would be rather large, which would lead to problems with overflow and inaccurate cancellation. We will therefore subtract 2010 so that $x$ runs from 2 to 19.
def p(x, a0, a1, a2, a3):
return a0 + a1*x + a2*x**2 + a3*x**3
We now use the curve_fit() function from scipy.optimize to fit a cubic polynomial to each of the four time series, and we plot each series along with its cubic approximation, using a different colour for each series.
colours = ['r', 'g', 'b', 'y']
a = np.zeros((4, 4))
for i in range(4):
ts0 = dfn.index.to_numpy().astype(float) - 2010
ys0 = dfn.iloc[:,i].to_numpy()
ts = np.linspace(2012, 2021, 100) - 2010
a[i] = curve_fit(p, ts0, ys0, p0=[0,0,0,0])[0]
ys = p(ts, *a[i])
plt.plot(ts0, ys0, 'o', color = colours[i])
plt.plot(ts, ys, color = colours[i])