Whitby tides¶
In this notebook we analyse data about tide levels in 2023 at Whitby in North Yorkshire. This data comes from the British Oceanographic Data Centre; you have to register there for an account if you want to download data yourself.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.signal import correlate
from scipy.fft import fft
We start by importing the data into a pandas dataframe. Lines 0 to 10 contain header information, so we skip them. Typical lines after that are like this
28) 2023/01/01 06:45:00 2.282 0.202
29) 2023/01/01 07:00:00 2.390M 0.196M
In columns 0 to 7 we have a line number, which we ignore. In columns 8 to 26 we have a date and time (with 15 minutes between successive measurements). It will be convenient to record the time as a number of hours after the beginning of the dataset. We use pd.Timestamp() to convert each date and time into a pandas object, then timestamp() to convert it into a number of seconds since the beginning of 1970. (This is the most common way that computers represent times as numbers, often referred to as a "Unix timestamp" or "Unix epoch".) We then subtract the timestamp of the beginning of the year and divide by 3600 (the number of seconds in an hour) to get a number of hours.
Next, in columns 28 to 36 we have a number representing the tide level. This is sometimes followed by the letter 'M' in column 37 but I have not found an explanation of what that means so we ignore it. Finally, in columns 38 to 47 there is another number called the residual, which we incorporate in our dataframe but do not use.
data_dir = '../data/misc'
data = []
i = 0
t0 = pd.Timestamp('2023-01-01 00:00:00').timestamp()
with open(f'{data_dir}/whitby_tides.txt') as f:
lines = f.readlines()
for line in lines[11:]:
if not line.strip():
continue # Skip any empty lines
try:
time = (pd.Timestamp(line[8:27]).timestamp() - t0) / 3600
except:
# Skip any lines that do not have a valid time
continue
level = float(line[28:37])
residual = float(line[38:48])
if level <= -99:
# Missing is indicated by a tide level of -99; skip these lines
continue
data.append([time, level, residual])
df = pd.DataFrame(data, columns=['time', 'level', 'residual'])
df
| time | level | residual | |
|---|---|---|---|
| 0 | 0.00 | 5.036 | 0.365 |
| 1 | 0.25 | 4.919 | 0.357 |
| 2 | 0.50 | 4.810 | 0.373 |
| 3 | 0.75 | 4.662 | 0.364 |
| 4 | 1.00 | 4.506 | 0.357 |
| ... | ... | ... | ... |
| 34937 | 8758.75 | 3.069 | 0.434 |
| 34938 | 8759.00 | 2.880 | 0.439 |
| 34939 | 8759.25 | 2.719 | 0.461 |
| 34940 | 8759.50 | 2.547 | 0.459 |
| 34941 | 8759.75 | 2.404 | 0.468 |
34942 rows × 3 columns
We now plot the tide level against time. If we plotted the whole year then we would have 35000 points and the plot would not be very clear. We therefore write a function which instead plots a specified range of days.
def plot_days(start_day, end_day, values = None, ax = None):
start_time = start_day * 24
end_time = end_day * 24
if values is None:
values = df['level']
if ax is None:
fig, ax = plt.subplots(figsize=(15, 6))
mask = (df['time'] >= start_time) & (df['time'] <= end_time)
ax.plot(df['time'][mask], values[mask])
for d in range(start_day, end_day + 1):
ax.axvline(x=d * 24, color='r', linestyle='--')
ax.set_xlabel('Time (hours)')
ax.set_ylabel('Tide level')
plot_days(100,110)
The theory of tides predicts that the tide level should be well-approximated by a constant plus a sum of nine oscillations with periods as given below. For each period $\tau_i$ we have the corresponding angular frequency $\omega_i=2\pi/\tau_i$, and any sinusoidal oscillation with period $p_i$ can be written as $A_i\sin(\omega_it+\phi_i)$ for some amplitude $A_i$ and phase $\phi_i$. Thus, the kinds of functions we are looking for have the form
$$ p(t,a_0,\dotsc,a_{18}) = a_0 + a_1\cos(\omega_0t) + a_2\sin(\omega_0t) + \dotsb + a_{17}\cos(\omega_8t) + a_{18}\sin(\omega_8t). $$
To define this in Python, we start with def p(t, *a):. This has the effect that all the arguments $a_0,\dotsc,a_{18}$ get combined into a tuple called a. The slice a[1::2] then gives the tuple $a_1,a_3,\dotsc,a_{17}$, which is the tuple of cosine amplitudes. Similarly, the slice a[2::2] gives the tuple $a_2,a_4,\dotsc,a_{18}$ of sine amplitudes. For a single time $t$, we could then write a[1::2] * np.cos(omega * t) to get the array $(a_1\cos(\omega_0t),\dotsc,a_{17}\cos(\omega_8t))$, then we could take the sum of this together with $a_0$ and the corresponding sine terms to get the value of the function $p(t,a_0,\dotsc,a_{18})$. However, this does not work correctly if $t$ is an array of times rather than a single time. To handle that possibility, we replace t by np.expand_dims(t,-1). This ensures that a[1::2] * np.cos(omega * t) will be a matrix, with one row for each entry in the list of times, and we use sum(axis=-1) to take the sum of each row separately.
tau = [
12.4206012, # Principal lunar semidiurnal
12, # Principal solar semidiurnal
12.65834751, # Larger lunar elliptic semidiurnal
23.93447213, # Lunar diurnal
6.210300601, # Shallow water overtides of principal lunar
25.81933871, # Lunar diurnal
4.140200401, # Shallow water overtides of principal lunar
8.177140247, # Shallow water terdiurnal
6 # Shallow water overtides of principal solar
]
omega = 2*np.pi/np.array(tau)
a0 = np.zeros(1 + 2 * len(omega))
def p_short(t, *a):
return (a[0] +
(a[1::2] * np.cos(omega * np.expand_dims(t,-1))).sum(axis=-1) +
(a[2::2] * np.sin(omega * np.expand_dims(t,-1))).sum(axis=-1))
def p_long(t, a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18):
return (a0 +
a1 * np.cos(omega[0] * t) + a2 * np.sin(omega[0] * t) +
a3 * np.cos(omega[1] * t) + a4 * np.sin(omega[1] * t) +
a5 * np.cos(omega[2] * t) + a6 * np.sin(omega[2] * t) +
a7 * np.cos(omega[3] * t) + a8 * np.sin(omega[3] * t) +
a9 * np.cos(omega[4] * t) + a10 * np.sin(omega[4] * t) +
a11 * np.cos(omega[5] * t) + a12 * np.sin(omega[5] * t) +
a13 * np.cos(omega[6] * t) + a14 * np.sin(omega[6] * t) +
a15 * np.cos(omega[7] * t) + a16 * np.sin(omega[7] * t) +
a17 * np.cos(omega[8] * t) + a18 * np.sin(omega[8] * t))
p = p_short
We now use the curve_fit() function from scipy.optimize to find values for $a_0,\dotsc,a_{18}$ which make the modelled tide levels $p(t_i,a_0,\dotsc,a_{18})$ as close as possible to the measured tide levels.
ts = df['time'].to_numpy()
ys = df['level'].to_numpy()
fit = curve_fit(p, ts, ys, p0 = a0, xtol=1e-11, maxfev=10000, full_output=True)
aa = fit[0]
print(f"Optimal constant: {aa[0]}")
print(f"Optimal amplitudes: {aa[1::2]}")
print(f"Optimal phases: {aa[2::2]}")
zs = p(ts, *aa)
Optimal constant: 3.5421270902565114 Optimal amplitudes: [ 1.19256096e+00 -4.73221830e-01 2.97558451e-01 -6.00008931e-02 -2.15307734e-02 1.02415961e-01 1.30216449e-03 -1.12941008e-03 -5.97073492e-03] Optimal phases: [-1.07817045 0.30676206 0.10903037 -0.12935731 0.01330247 -0.1286847 -0.01046758 0.01039158 0.0015952 ]
We now plot the modelled and measured levels for a range of times:
N0 = 8000
N1 = 12000
fig, ax = plt.subplots(figsize=(15, 6))
ax.plot(ts[N0:N1], ys[N0:N1], color='blue')
ax.plot(ts[N0:N1], zs[N0:N1], color='red')
[<matplotlib.lines.Line2D at 0x2a5119b9810>]
Instead of the function p defined above, it would be essentially equivalent to use the function
$$ p(t,a_0,\dotsc,a_{18}) = a_0 + a_1\sin(\omega_0t+a_2) + a_3\sin(\omega_1t+a_4) + \dotsb + a_{17}\sin(\omega_8t+a_{18}). $$
A Python version of this is given below.
def p_phase(t, *a):
c = a[0]
amplitude = a[1::2]
phase = a[2::2]
return (amplitude * np.sin(omega * np.expand_dims(t,-1) + phase)).sum(axis=-1) + c