MAS2008 Scientific Computing: Lab 8
Data analysis and curve fitting with Pandas


The following notebooks and videos are relevant for this lab:
Favourite colours View Download
Mauna Loa carbon dioxide concentration View Download
Peak District hills View Download

To try the above notebooks yourself, you will need to download the following data files:
favourite_colours.csv co2_daily_mlo.txt peak_district_hills.csv

There is also an interactive demonstration explaining pandas selection methods.

Task 1: The Periodic Table

Download the file elements.csv and load it into a pandas DataFrame. There are online test question corresponding to some of the subtasks below.

Task 2: Whitby tides

Download the file whitby_tides.txt, which contains a record of the tide levels at Whitby (in North Yorkshire) every 15 minutes in 2023. We will want to convert this dataset into a pandas DataFrame, but this will require some auxiliary steps. If the above file has been saved in the same folder as this notebook, we can read all the lines in the file as follows:

    with open(f'whitby_tides.txt') as f:
      lines = f.readlines()
The first few lines of the file are headers, which you should ignore. The remaining lines are like this:

    10) 2023/01/01 02:15:00     3.681      0.391  
    11) 2023/01/01 02:30:00     3.498      0.388  
    12) 2023/01/01 02:45:00     3.323M     0.389M 
    13) 2023/01/01 03:00:00     3.154M     0.390M 
For each line, you should extract the time and the tide level. You can either use line.split() or slices like line[10:25] to get parts of the string. You should represent the time as the number of hours since the start of the year. Given a suitable string s, the function pd.Timestamp(s).timestamp() will give you the number of seconds since the start of 1970, and you can work from there. The tide level is the first number after the time. Some of the tide levels are followed by an M, which you should ignore. If the tide level is given as -99, that means that the data is missing, so you should skip that line. You should also skip any lines that are empty. It will probably be simplest to start by creating a list, but when the list is complete you should convert it into a pandas dataframe with columns time and level.

Now plot the tide levels against time. If you do this for the whole dataset then the plot will be very crowded, so you should experiment with plotting a few days or weeks at a time. The time will be given in hours, but you could add vertical dotted lines every 24 hours to make the daily variation easier to see.

Next, we will try to fit a curve to the data. Start by copying the following code:

    tau = [
     12.4206012,  # Principal lunar semidiurnal
     12,          # Principal solar semidiurnal
     12.65834751, # Larger lunar elliptic semidiurnal
     23.93447213, # Lunar diurnal
     6.210300601, # Shallow water overtides of principal lunar
     25.81933871, # Lunar diurnal
     4.140200401, # Shallow water overtides of principal lunar
     8.177140247, # Shallow water terdiurnal
     6            # Shallow water overtides of principal solar

    omega = 2*np.pi/np.array(tau)
This defines an array $(\omega_0,\dotsc,\omega_8)$. Theory (which you can read about on Wikipedia) predicts that the tide level at time $t$ should be given by $$ p(t,a0,\dotsc,a18) = a0 + a1\cos(\omega_0 t) + a2\sin(\omega_0 t) + \dotsb + a17\cos(\omega_8 t) + a18\sin(\omega_8 t). $$ Enter this as a Python definition. One approach is to just write out the twenty function arguments and nineteen terms in the sum explicitly. Alternatively, you can start the definition with def p(t,*a):, then a will be a tuple and you can write a loop or use numpy array methods.

Now use scipy.optimize.curve_fit to fit the curve to the data. You can see the Mauna Loa notebook (linked at the top of this page) for an example of how to do this. Plot the actual tide level and the modelled tide level on the same graph, for $t$ from 1000 to 2000 hours.

Task 3: Spurious correlations

Consider the following data (obtained from Tyler Vigen's website):
Create a pandas dataframe df from this data. The years should give the index, and the column names should be 'Psychology', 'Groundkeepers', 'Georgia' and 'Somalia'. Then use df.corr() to calculate the correlations between the columns. You should find that the correlations are all quite high, suggesting that the time variation of the columns is similar.

To visualise this, make a scatter plot of the psychology column against the groundkeepers column. You should see that the points lie very close to a straight line. For other pairs of columns the relationship is still strong but not as clear.

Now use df.plot() to plot all the columns against time. The plot will not be very informative, because the scales are so different: the number of bachelors degrees is in the hundreds of thousands, while the energy usage numbers are all less than one. To fix this, we can normalise each column by subtracting the mean and dividing by the standard deviation (which can be calculated by the mean() and std() methods). Create a new dataset dfn with the same headers but with normalised columns. Now dfn.plot() should show four graphs that are fairly similar.

Task 4: Fruit and vegetables

Download and unzip the file, which contains 48 images of fruit and vegetables. There are six different types (apples, bananas, carrots, oranges, plums and green peppers) with eight images of each type. We will try to classify the images by type, just using the overall colour. We will need the following imports:

    from PIL import Image, ImageDraw
    import pandas as pd
    from pathlib import Path
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.cluster import KMeans
Now enter files = list(Path('fruit').glob('*.jpg')) to get a list of all the image files. (This assumes that your notebook is in a certain folder, and that the images are in a subfolder of that folder called fruit.) Now enter img =[0]) and display(img) to load and display the first image. Then enter A = np.asarray(img) to get an array of shape (500,500,3) in which A[i,j,0] is the amount of red at position (i,j), and A[i,j,1] is the amount of green, and A[i,j,2] is the amount of blue. All of these numbers are integers between 0 and 255.

It will be more convenient to work with a matrix a of shape (250000,3); use the reshape() method to convert A to this form. We then want to take the average of each column in a, giving a vector of length three containing the red, green and blue values for the average colour of the image. If we just do a.mean(), we will get a single number which is the average of all numbers in a, which is not what we want. Work out how to modify this to give the required result. Also, round the resulting numbers to the nearest integer, and use the astype() method to ensure that the result has integer data type.

Now make a loop to apply all the above steps to each of the 48 images, giving an array average_colour of shape (48, 3). Then enter

This will give an array of dots, showing the average colours of all the images. You will see that the result is not very satisfactory: the average includes all the white border around each image so the resulting colour is very pale. To fix this, go back to your loop and insert the line a = a[a.min(axis=1) < 240] (which will discard the white pixels) before taking the average, then regenerate the scatter plot.

Now enter the following lines to divide the images into six groups with similar colours:

    kmeans = KMeans(n_clusters=6)
This will display an array kmeans.labels_ of 48 integers, each in the set $\{0,\dotsc,5\}$. You can display all the images with label zero as follows:
    for i in range(48):
      if kmeans.labels_[i] == 1:
You can display the other groups in the same way, or (better) work out how to make a single picture showing thumbnails of all the images arranged in groups. You should find that the method used here is moderately successful: the groups are roughly right but the oranges are not separated from the carrots and there are various other errors as well.