Peak district hills¶

In [1]:
import numpy as np
import pandas as pd
import geopy.distance

We use pandas to import a CSV file containing details of hills in the Peak District.

(This is a subset of the much larger database of British and Irish hills which you can find at https://www.hills-database.co.uk/.)

In [2]:
data_dir = '../data/misc'
df = pd.read_csv(data_dir + '/peak_district_hills.csv')

Here are several ways to get a general picture of what is in the data frame.

In [9]:
display(df)
display(df.describe())
display(df.info())
print(f"Columns: {list(df.columns)}")
print(f"Shape: {df.shape}")
Name Height Drop Latitude Longitude County
0 Kinder Scout 636.3 496.6 53.384808 -1.873910 Derbyshire
1 Bleaklow Head 633.0 128.0 53.461267 -1.858959 Derbyshire
2 Higher Shelf Stones 621.8 19.8 53.449816 -1.867610 Derbyshire
3 Black Hill 582.0 165.0 53.538013 -1.885306 Derbyshire/Kirklees
4 Shining Tor 559.0 235.7 53.260729 -2.009472 Cheshire East/Derbyshire
... ... ... ... ... ... ...
108 Snailsden Pike End 477.2 28.5 53.527289 -1.807632 Barnsley
109 Sough Top 438.5 16.3 53.234919 -1.802053 Derbyshire
110 Stanedge Pole 444.1 17.1 53.356177 -1.630601 Derbyshire
111 Alphin Pike 469.6 0.9 53.522014 -1.996965 Tameside
112 Cheeks Hill 521.7 5.1 53.226655 -1.961680 Derbyshire

113 rows × 6 columns

Height Drop Latitude Longitude
count 113.000000 113.000000 113.000000 113.000000
mean 446.914159 63.879646 53.298625 -1.843522
std 84.105091 59.688096 0.131592 0.122374
min 235.000000 0.000000 53.047864 -2.144702
25% 380.000000 30.600000 53.203591 -1.945962
50% 450.000000 51.000000 53.291318 -1.841480
75% 514.000000 85.000000 53.380700 -1.751426
max 636.300000 496.600000 53.586204 -1.562219
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       113 non-null    object 
 1   Height     113 non-null    float64
 2   Drop       113 non-null    float64
 3   Latitude   113 non-null    float64
 4   Longitude  113 non-null    float64
 5   County     113 non-null    object 
dtypes: float64(4), object(2)
memory usage: 5.4+ KB
None
Columns: ['Name', 'Height', 'Drop', 'Latitude', 'Longitude', 'County']
Shape: (113, 6)

The plot below shows the height of the various hills in ascending order. We use df['Height'] to get the series of heights, and then sort_values() to sort them in order. However, this still leaves us with a series in which every entry is tagged with an index number giving its position in the original dataset. If we leave that index number in place, then the plotting methods will plot the heights in the order given by that index number. We therefore use reset_index(drop=True) to get rid of the index numbers before plotting.

In [4]:
df['Height'].sort_values().reset_index(drop=True).plot()
Out[4]:
<Axes: >
No description has been provided for this image

We next draw a histogram of the heights of the hills.

In [5]:
df['Height'].hist(bins=np.arange(200,700,50), legend=True)
Out[5]:
<Axes: >
No description has been provided for this image

Here we display a more complex histogram showing both the height and the drop. (The drop is the amount of height that you lose when descending from one hill to the highest saddle that separates it from an adjacent hill.)

In [6]:
df[['Height','Drop']].plot.hist()
Out[6]:
<Axes: ylabel='Frequency'>
No description has been provided for this image

Here is a list of all the words that appear in the County column. We first use df.County.unique() (or alternatively df['County].unique()) to get a numpy array containing all the values in the County column, without repetitions. This contains many strings like "Derbyshire/Kirklees". We use s.split('/') to split each such string at the '/' character. This will produce some more repetitions, but we do this inside a comprehension {w for s in ...} with curly brackets which produces a set rather than a list and so removes the repetitions. We then wrap this in sorted(list(...)) so that the end result is a sorted list.

In [7]:
county_words = sorted(list({w for s in df.County.unique() for w in s.split('/')}))
county_words
Out[7]:
['Barnsley',
 'Cheshire East',
 'Derbyshire',
 'Kirklees',
 'Oldham',
 'Sheffield',
 'Staffordshire',
 'Stockport',
 'Tameside']

We now add a new column to the data frame containing the distances (in km) from the hills to the Hicks Building. Note that positions are recorded in the data frame as latitude and longitude, and it is quite complicated to calculate distances from latitudes and longitudes, especially if you want to be accurate and take account of the fact that the earth is not precisely spherical. Fortunately the function geopy.distance.distance() will do the work for us.

In [30]:
hicks_position = 53.38085, -1.48636
def hicks_distance(row):
    return geopy.distance.distance(hicks_position, (row["Latitude"], row["Longitude"])).km
df["Distance"] = df.apply(hicks_distance, axis=1)

We now want to tidy up the names of the hills. Some of them are like "Howden Edge [High Stones]" with extra text in square or round brackets, or after a dash. We remove all such extra text, and then remove any spaces from the beginning and end. Note that when applying string methods to the Name column, we have to call it df["Name"].str and not just df["Name"].

In [33]:
df["Name"] = df["Name"].str.replace(r"\([A-Za-z' ]*\)","",regex=True)
df["Name"] = df["Name"].str.replace(r"\[[A-Za-z' ]*\]","",regex=True)
df["Name"] = df["Name"].str.replace(r"- [A-Za-z' ]*","",regex=True)
df["Name"] = df["Name"].str.strip()

We next add a new column containing the type of each feature, which is just the last word in the name. Very often this is "Hill", but sometimes it is "Tor" or "Moor" or something else. By default the new column will appear as the last column in the dataframe, but the second line below rearranges the columns so that the type is the second column, directly after the name.

In [35]:
def last_word(s):
    return s.split(' ')[-1]

df["Type"] = df["Name"].map(last_word)
df = df[["Name","Type","Height","Drop","Latitude","Longitude","Distance","County"]]

Here is the full list of types. We have added some extra code to lay it out with ten words per line, so that we do not need to scroll horizontally or vertically.

In [37]:
types = list(df["Type"].unique())
print("\n".join([' '.join(types[10*i:10*(i+1)]) for i in range(len(types)//10+1)]))
Scout Head Stones Hill Tor Gun Cloud Moor Knoll Edge
Seat Ridge Top Naze Shutlingsloe Roaches Nab Neb Churn Pike
Bolehill Rods Moss Low End Cop Cliff Famine Bank Wheeldon
Revidge Rocks Folly Lad Pole

We now make a new data frame of types, together with the number of features of each type and their average height. The groupby method produces an object of class pandas.DataFrameGroupBy, which is essentially a list of dataframes, each containing all the hills of one type. The agg() method then aggregates information from each of these smaller dataframes, producing a new dataframe with one row for each type. The argument count=('Name', 'size') means that the new dataframe should have a column called count, and the entries in that column should be the sizes of the corresponding groups of names. For example, there are nine hills of type "Moor", so the new dataset will have a row with Type="Moor" and count=9. Similarly, the argument height=('Metres', 'mean') indicates that the new dataframe should have a column called height, and the entries in that column should be the average height of the hills of the relevant type.

There are various types for which there are only one or two hills. The code dft[dft["count"] > 2] gives a smaller dataframe in which those types are excluded, and we print the remaining types in descending order of their average heights.

In [41]:
dft = df.groupby("Type") \
        .agg(count=('Name', 'size'), height=('Height', 'mean')) \
        .sort_values(by='height', ascending=False) \
        .reset_index()

dft[dft["count"] > 2]
Out[41]:
Type count height
5 Head 4 543.000000
8 Tor 5 514.160000
12 Moss 3 498.300000
16 Edge 12 469.825000
20 Low 8 431.375000
21 Hill 33 427.248485
25 Pike 3 404.200000
26 Moor 9 403.622222
30 Cloud 3 346.666667

The next three lines all give the first five rows of the dataframe. This is because the index column at the left hand end of the dataframe just contains the numbers 0 to 113 in order. Sometimes we will deal with dataframes in which the index column is different, containing names or labels of some kind. In that context, we would use df.loc[] with labels and df.iloc[] with row numbers.

In [14]:
display(df[0:5])
display(df.loc[0:5])
display(df.iloc[0:5])
Name Type Height Drop Latitude Longitude Distance County
0 Kinder Scout Scout 636.3 496.6 53.384808 -1.873910 25.792033 Derbyshire
1 Bleaklow Head Head 633.0 128.0 53.461267 -1.858959 26.338430 Derbyshire
2 Higher Shelf Stones Stones 621.8 19.8 53.449816 -1.867610 26.486259 Derbyshire
3 Black Hill Hill 582.0 165.0 53.538013 -1.885306 31.751218 Derbyshire/Kirklees
4 Shining Tor Tor 559.0 235.7 53.260729 -2.009472 37.334829 Cheshire East/Derbyshire
Name Type Height Drop Latitude Longitude Distance County
0 Kinder Scout Scout 636.3 496.6 53.384808 -1.873910 25.792033 Derbyshire
1 Bleaklow Head Head 633.0 128.0 53.461267 -1.858959 26.338430 Derbyshire
2 Higher Shelf Stones Stones 621.8 19.8 53.449816 -1.867610 26.486259 Derbyshire
3 Black Hill Hill 582.0 165.0 53.538013 -1.885306 31.751218 Derbyshire/Kirklees
4 Shining Tor Tor 559.0 235.7 53.260729 -2.009472 37.334829 Cheshire East/Derbyshire
5 Win Hill Hill 463.4 144.7 53.362373 -1.720964 15.749905 Derbyshire
Name Type Height Drop Latitude Longitude Distance County
0 Kinder Scout Scout 636.3 496.6 53.384808 -1.873910 25.792033 Derbyshire
1 Bleaklow Head Head 633.0 128.0 53.461267 -1.858959 26.338430 Derbyshire
2 Higher Shelf Stones Stones 621.8 19.8 53.449816 -1.867610 26.486259 Derbyshire
3 Black Hill Hill 582.0 165.0 53.538013 -1.885306 31.751218 Derbyshire/Kirklees
4 Shining Tor Tor 559.0 235.7 53.260729 -2.009472 37.334829 Cheshire East/Derbyshire
In [15]:
# Some of the columns from a single row
display(df.loc[3,['Latitude','Longitude']])
Latitude     53.538013
Longitude    -1.885306
Name: 3, dtype: object
In [16]:
# All the columns from the top row
df.loc[0]
Out[16]:
Name         Kinder Scout
Type                Scout
Height              636.3
Drop                496.6
Latitude        53.384808
Longitude        -1.87391
Distance        25.792033
County         Derbyshire
Name: 0, dtype: object

When referring to a single entry in the dataframe, it is more efficient to use df.at or df.iat instead of df.loc or df.iloc.

In [17]:
print(df.at[3,"Latitude"])
print(df.iat[3,4])
53.538013
53.538013

Count the number of hills where we have to descend by more than 100m before reaching a saddle.

In [18]:
len(df[df["Drop"] > 100])
Out[18]:
20

Show the list of hills where the county name contains the string "Sheffield".

In [19]:
df[df["County"].str.contains("Sheffield")]
Out[19]:
Name Type Height Drop Latitude Longitude Distance County
11 Howden Edge Edge 550.0 62.0 53.445603 -1.718567 17.039469 Sheffield
18 Back Tor Tor 538.0 67.0 53.415411 -1.704127 14.987408 Derbyshire/Sheffield
19 Horse Stone Naze Naze 527.0 32.0 53.474428 -1.763184 21.143873 Sheffield
28 High Neb Neb 458.0 107.0 53.364684 -1.659110 11.637730 Derbyshire/Sheffield
39 Hoar Stones Stones 514.0 7.0 53.481348 -1.762573 21.497691 Barnsley/Sheffield
44 Outer Edge Edge 541.0 23.0 53.469227 -1.734726 19.218125 Sheffield
97 Margery Hill Hill 546.0 19.0 53.457724 -1.716664 17.539911 Sheffield
106 Higger Tor Tor 434.8 22.0 53.333791 -1.615132 10.046918 Sheffield
107 Lost Lad Lad 519.0 10.0 53.417404 -1.710523 15.455555 Derbyshire/Sheffield

Plot the positions of the hills. The sizes of the discs are determined by the heights of the hills, and the colours are determined by the drop.

In [20]:
df.assign(s=lambda x: 0.2*x.Height).plot.scatter(x='Longitude', y='Latitude', c='Drop', s='s', colormap='viridis')
Out[20]:
<Axes: xlabel='Longitude', ylabel='Latitude'>
No description has been provided for this image
In [ ]: