MAS2008 Scientific Computing: Lab 8
Data analysis and curve fitting with Pandas

Instructions

The following notebooks and videos are relevant for this lab:


Favourite colours	View	Download
Mauna Loa carbon dioxide concentration	View	Download
Peak District hills	View	Download

To try the above notebooks yourself, you will need to download the following data files:

favourite_colours.csv

co2_daily_mlo.txt

peak_district_hills.csv

There is also an interactive demonstration explaining pandas selection methods.

Task 1: The Periodic Table

Download the file elements.csv and load it into a pandas DataFrame. There are online test question corresponding to some of the subtasks below.

Work out how many columns there are in the dataframe.
Each element has a type: lithium is an alkali metal, chlorine is a halogen and so on. Work out how to get a list of all the types, in alphabetical order, without repetitions. (You could search for "pandas unique" if necessary.)
Use the groupby() method to count how many elements there are of each type.
Make a smaller dataframe which just lists the names and symbols of elements that are gasses at standard temperature and pressure. (Use the Phase column to determine this.) Read the documentation for reset_index() and use it to tidy up the index of the new dataframe without introducing a useless extra column.
Extract the row for the artificial element Darmstadtium. You should see that many values are listed as NaN, because they have not been measured. Use the dropna() method to remove these rows.
Use the argmax() method to extract the row corresponding to the element with the highest boiling point.

Enter code as follows:


     type_colours = pd.DataFrame([
       ['Alkali Metal','purple'],
       ['Alkaline Earth Metal','blue'],
       ['Halogen','cyan'],
       ['Transition Metal','green'],
       ['Metal','yellow'],
       ['Metalloid','orange'],
       ['Nonmetal','red'],
       ['Noble Gas','grey'],
       ['Actinide','brown'],
       ['Transactinide','purple'],
       ['Lanthanide','pink']
      ],columns=['Type','Colour'])

      df = df.merge(type_colours,how='left',on='Type')

This will add a column to your dataframe which gives a colour for each type of element.

Use matplotlib to make a picture of the periodic table. For each element, you should use ax.text() to put the symbol in the correct place, as determined by the Display Row and Display Column columns in the dataframe, with the colour determined by the Colour column. When I did this, I found it necessary to draw a rectangular box around the outside, otherwise the positions got messed up for some reason that I have not understood.

Task 2: Whitby tides

Download the file whitby_tides.txt, which contains a record of the tide levels at Whitby (in North Yorkshire) every 15 minutes in 2023. We will want to convert this dataset into a pandas DataFrame, but this will require some auxiliary steps. If the above file has been saved in the same folder as this notebook, we can read all the lines in the file as follows:


    with open(f'whitby_tides.txt') as f:
      lines = f.readlines()

The first few lines of the file are headers, which you should ignore. The remaining lines are like this:


    10) 2023/01/01 02:15:00     3.681      0.391  
    11) 2023/01/01 02:30:00     3.498      0.388  
    12) 2023/01/01 02:45:00     3.323M     0.389M 
    13) 2023/01/01 03:00:00     3.154M     0.390M

For each line, you should extract the time and the tide level. You can either use line.split() or slices like line[10:25] to get parts of the string. You should represent the time as the number of hours since the start of the year. Given a suitable string s, the function pd.Timestamp(s).timestamp() will give you the number of seconds since the start of 1970, and you can work from there. The tide level is the first number after the time. Some of the tide levels are followed by an M, which you should ignore. If the tide level is given as -99, that means that the data is missing, so you should skip that line. You should also skip any lines that are empty. It will probably be simplest to start by creating a list, but when the list is complete you should convert it into a pandas dataframe with columns time and level.

Now plot the tide levels against time. If you do this for the whole dataset then the plot will be very crowded, so you should experiment with plotting a few days or weeks at a time. The time will be given in hours, but you could add vertical dotted lines every 24 hours to make the daily variation easier to see.

Next, we will try to fit a curve to the data. Start by copying the following code:


    tau = [
     12.4206012,  # Principal lunar semidiurnal
     12,          # Principal solar semidiurnal
     12.65834751, # Larger lunar elliptic semidiurnal
     23.93447213, # Lunar diurnal
     6.210300601, # Shallow water overtides of principal lunar
     25.81933871, # Lunar diurnal
     4.140200401, # Shallow water overtides of principal lunar
     8.177140247, # Shallow water terdiurnal
     6            # Shallow water overtides of principal solar
    ]

    omega = 2*np.pi/np.array(tau)

This defines an array $(\omega_0,\dotsc,\omega_8)$. Theory (which you can read about on Wikipedia) predicts that the tide level at time $t$ should be given by $$ p(t,a0,\dotsc,a18) = a0 + a1\cos(\omega_0 t) + a2\sin(\omega_0 t) + \dotsb + a17\cos(\omega_8 t) + a18\sin(\omega_8 t). $$ Enter this as a Python definition. One approach is to just write out the twenty function arguments and nineteen terms in the sum explicitly. Alternatively, you can start the definition with def p(t,*a):, then a will be a tuple and you can write a loop or use numpy array methods.

Now use scipy.optimize.curve_fit to fit the curve to the data. You can see the Mauna Loa notebook (linked at the top of this page) for an example of how to do this. Plot the actual tide level and the modelled tide level on the same graph, for $t$ from 1000 to 2000 hours.

Task 3: Spurious correlations

Consider the following data (obtained from Tyler Vigen's website):

  [
   [2012,2013,2014,2015,2016,2017,2018,2019,2020,2021],
   [109099,114446,117312,117573,117447,116859,116436,116550,119989,126944],
   [8070,9200,9900,9800,9640,9450,9510,9510,9990,11400],
   [1067,1264,1333,1438,1417,1319,1421,1477,1611,1739],
   [0.32,0.33,0.33,0.33,0.33,0.33,0.33,0.329,0.35,0.3724]
  ]

The first row is a list of years.
The second row is the number of bachelors degrees in psychology awarded in each year.
The third row is the number of people employed as groundkeepers in Utah.
The fourth row is the number of babies named Georgia born in each year.
The fifth row os the number of terawatt hours of fossil fuel energy used in Somalia in each year.

Create a pandas dataframe df from this data. The years should give the index, and the column names should be 'Psychology', 'Groundkeepers', 'Georgia' and 'Somalia'. Then use df.corr() to calculate the correlations between the columns. You should find that the correlations are all quite high, suggesting that the time variation of the columns is similar.

To visualise this, make a scatter plot of the psychology column against the groundkeepers column. You should see that the points lie very close to a straight line. For other pairs of columns the relationship is still strong but not as clear.

Now use df.plot() to plot all the columns against time. The plot will not be very informative, because the scales are so different: the number of bachelors degrees is in the hundreds of thousands, while the energy usage numbers are all less than one. To fix this, we can normalise each column by subtracting the mean and dividing by the standard deviation (which can be calculated by the mean() and std() methods). Create a new dataset dfn with the same headers but with normalised columns. Now dfn.plot() should show four graphs that are fairly similar.

Task 4: Fruit and vegetables

Download and unzip the file food.zip, which contains 48 images of fruit and vegetables. There are six different types (apples, bananas, carrots, oranges, plums and green peppers) with eight images of each type. We will try to classify the images by type, just using the overall colour. We will need the following imports:


    from PIL import Image, ImageDraw
    import pandas as pd
    from pathlib import Path
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.cluster import KMeans

Now enter files = list(Path('fruit').glob('*.jpg')) to get a list of all the image files. (This assumes that your notebook is in a certain folder, and that the images are in a subfolder of that folder called fruit.) Now enter img = Image.open(files[0]) and display(img) to load and display the first image. Then enter A = np.asarray(img) to get an array of shape (500,500,3) in which A[i,j,0] is the amount of red at position (i,j), and A[i,j,1] is the amount of green, and A[i,j,2] is the amount of blue. All of these numbers are integers between 0 and 255.

It will be more convenient to work with a matrix a of shape (250000,3); use the reshape() method to convert A to this form. We then want to take the average of each column in a, giving a vector of length three containing the red, green and blue values for the average colour of the image. If we just do a.mean(), we will get a single number which is the average of all numbers in a, which is not what we want. Work out how to modify this to give the required result. Also, round the resulting numbers to the nearest integer, and use the astype() method to ensure that the result has integer data type.

Now make a loop to apply all the above steps to each of the 48 images, giving an array average_colour of shape (48, 3). Then enter


    plt.scatter(*np.divmod(range(48),8),color=list(map(tuple,average_colours/255)))

This will give an array of dots, showing the average colours of all the images. You will see that the result is not very satisfactory: the average includes all the white border around each image so the resulting colour is very pale. To fix this, go back to your loop and insert the line a = a[a.min(axis=1) < 240] (which will discard the white pixels) before taking the average, then regenerate the scatter plot.

Now enter the following lines to divide the images into six groups with similar colours:


    kmeans = KMeans(n_clusters=6)
    kmeans.fit(average_colours)
    kmeans.labels_

This will display an array kmeans.labels_ of 48 integers, each in the set $\{0,\dotsc,5\}$. You can display all the images with label zero as follows:

  
    for i in range(48):
      if kmeans.labels_[i] == 1:
          display(Image.open(files[i]))

You can display the other groups in the same way, or (better) work out how to make a single picture showing thumbnails of all the images arranged in groups. You should find that the method used here is moderately successful: the groups are roughly right but the oranges are not separated from the carrots and there are various other errors as well.

MAS2008 Scientific Computing: Lab 8 Data analysis and curve fitting with Pandas