Favourite colours¶

In this notebook (which covers roughly the same ground as this video) we process a dataset of peoples' favourite colours. (We will need to use the US spelling of "color" in various places.) The data is in the file favourite_colours.csv, which was downloaded from https://www.kaggle.com/datasets/bsoyka3/favorite-colors?resource=download. This file starts with a header row:

Country,State or territory,Age in years,Gender,Favorite color

Each subsequent row contains information about one person who responded to a survey. There number of respondents is N=84. We will just be interested in the last entry in each row, which is a colour specified as a hex code, such as #a313d4. There are many places online where you can convert hex codes to colour names or vice-versa, such as https://www.color-hex.com/. For example, #a313d4 is a shade of violet. Most respondents have chosen a very specific colour different from that chosen by all other respondents. Our goal will be to divide the respondents into a small number of clusters with similar favourite colours, and make some plots to display the sizes of these clusters. We will use the pandas and sklearn packages for this. We will also use the colory package (from https://github.com/apoorvaeternity/colory) to look up names for colours (based on the names at https://xkcd.com/color/rgb/).

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d
from sklearn.cluster import KMeans
from colory.color import Color

# %matplotlib widget

We first import the data into a pandas dataframe using the function pd.read_csv()

In [5]:
data_dir = '../data/misc'
df = pd.read_csv(data_dir + "/favourite_colours.csv")
df
Out[5]:
Country State or territory Age in years Gender Favorite color
0 United States Texas 14 Male #0b389b
1 United States West Virginia 17 Male #000080
2 United States Georgia 17 Male #42f587
3 United States California 16 Female #ffb8e8
4 United States Texas 16 Female #6facde
... ... ... ... ... ...
79 United States Washington 29 Female #a911f5
80 United States Texas 19 Female #ffd061
81 Saudi Arabia NaN 14 NaN #ff8c00
82 United States Illinois 24 Other #ba260f
83 United States Texas 25 Male #5f14d9

84 rows × 5 columns

We next want to put the favourite colour information into a more digestible form. Colours as represented as triples (r,g,b), where r is the amount of red, g is the amount of green and b is the amount of blue. Each of these numbers lies between $0$ and $2^8-1=255$. The numbers are often represented in hexadecimal, i.e. base 16. In hexadecimal a represents 10, b represents 11, c represents 12, d represents 13, e represents 14 and f represents 15. Thus a3 represents $10\times 16+3=163$ and 13 represents $1\times 16+3=19$ and d4 represents $13\times 16+4=208$. This means that the hex code #a313d4 represents the triple (163,19,204) with a lot of blue, a bit less red, and only a little green, making violet.

In Python, we can convert a hexadecimal string s to an integer using int(s, 16). All of our colour codes c start with a # in position 0, so the amount of red is represented by the characters in positions 1 and 2, which form the string c[1:3].

In [10]:
df['fc_r'] = df['Favorite color'].apply(lambda c: int(c[1:3], 16))
df['fc_g'] = df['Favorite color'].apply(lambda c: int(c[3:5], 16))
df['fc_b'] = df['Favorite color'].apply(lambda c: int(c[5:7], 16))

We next want to divide the respondents into clusters with similar favourite colours. (In fact there is not really any clear clustering in this dataset, so the process is a bit artificial, but it will do as an example.) We will use the KMeans class from the package sklearn.cluster; this implements the $k$-means clustering algorithm, for which there are many explanations on the internet. We first set $n$ to be the number of clusters that we want; we will initially take $n=7$ but you can change that if you want. We next need to construct a numpy array of shape (N, 3) with one row for each respondent containing the corresponding (red, green, blue) values. (Here N is the number of respondents, which is $84$.) By combining the columns fc_r, fc_g and fc_b from our dataframe we get an array of shape (3, N) and then we append .T to take the transpose, which gives the required matrix colour_matrix of shape (N, 3). We then call kmeans = KMeans(n_clusters=n) create a $k$-means fitter, and then call kmeans.fit(colour_matrix) to actually do the fitting and find the clusters.

In [11]:
n = 7
colour_matrix = np.array([df['fc_r'], df['fc_g'], df['fc_b']]).T
kmeans = KMeans(n_clusters=n, n_init=10)
kmeans.fit(colour_matrix)
None

The main result of calling kmeans.fit(colour_matrix) is to set kmeans.labels_. This becomes an array of length $N$, with entries in the range $0,\dotsc,n-1$; the $i$'th entry is $j$ if respondent $i$'s favourite colour is in cluster $j$. We make this array into a new column df['ct'] in our dataframe.

We next want to find a typical colour for each cluster. As a result of the fitting process, kmeans.cluster_centers_ becomes an array of shape (n, 3) in which row j is the (red, green, blue) triple for the centre of the jth cluster. The entries in this array are floating point numbers rather than integers. We can apply np.rint() to round them, but there is a slight issue: the result contains floating point numbers like 12.0 whose fractional part is zero, but those are technically different from actual integers. We append .astype(int) to convert them to integers. Now we have rows of the form [r,g,b], and we want to convert each row into a hex code. We first form the integer $d=16^4r+16^2g+b$, so the hexadecimal representation of $c$ is the representation of $r$ followed by the representation of $g$ followed by the representation of $b$. We then use the format string f"#{d:06x}" to convert this into a hex code, consisting of the character # followed by a hexadecimal number padded with zeros if necessary to make the length equal to $6$.

In [14]:
df['ct'] = kmeans.labels_
cluster_cols = [f"#{d:06x}" for d in (np.rint(kmeans.cluster_centers_).astype(int) @ [16 ** 4,16 ** 2,1])]
cluster_cols
Out[14]:
['#2498ac', '#5a18a3', '#c1b1da', '#b80d14', '#298c30', '#100808', '#e8b417']

We next use the Color function (which we imported from colory.color) to find names for the cluster colours. (You can find the code for the colory package on GitHub. It is not too complicated, so you could take it as an exercise to understand it.) The names are based on the list at https://xkcd.com/color/rgb/. This is associated with the web comic XKCD, which everyone should read. We also use np.bincount() to make a list of the number of respondents in each cluster. We display the result as a bar chart and then as a pie chart.

In [15]:
cluster_names = [Color(c,"xkcd").name for c in cluster_cols]
cluster_counts = np.bincount(df['ct'])
plt.bar(range(n), cluster_counts, color = cluster_cols)
plt.xticks(range(n), cluster_names, rotation='vertical')
plt.show()
plt.pie(cluster_counts, labels=cluster_names, colors=cluster_cols)
plt.show()
None
No description has been provided for this image
No description has been provided for this image

Finally, we make a condensed summary of the clusters for use elsewhere.

In [16]:
data = [[cluster_cols[i], cluster_names[i], cluster_counts[i]] for i in range(n)]
data
Out[16]:
[['#2498ac', 'Sea', np.int64(13)],
 ['#5a18a3', 'Indigo Blue', np.int64(17)],
 ['#c1b1da', 'Cloudy Blue', np.int64(17)],
 ['#b80d14', 'Scarlet', np.int64(9)],
 ['#298c30', 'Darkish Green', np.int64(11)],
 ['#100808', 'Almost Black', np.int64(10)],
 ['#e8b417', 'Squash', np.int64(7)]]
In [ ]: