Peak district hills¶

In [1]:

import numpy as np
import pandas as pd
import geopy.distance

We use pandas to import a CSV file containing details of hills in the Peak District.

(This is a subset of the much larger database of British and Irish hills which you can find at https://www.hills-database.co.uk/.)

In [2]:

data_dir = '../data/misc'
df = pd.read_csv(data_dir + '/peak_district_hills.csv')

Here are several ways to get a general picture of what is in the data frame.

In [9]:

display(df)
display(df.describe())
display(df.info())
print(f"Columns: {list(df.columns)}")
print(f"Shape: {df.shape}")

	Name	Height	Drop	Latitude	Longitude	County
0	Kinder Scout	636.3	496.6	53.384808	-1.873910	Derbyshire
1	Bleaklow Head	633.0	128.0	53.461267	-1.858959	Derbyshire
2	Higher Shelf Stones	621.8	19.8	53.449816	-1.867610	Derbyshire
3	Black Hill	582.0	165.0	53.538013	-1.885306	Derbyshire/Kirklees
4	Shining Tor	559.0	235.7	53.260729	-2.009472	Cheshire East/Derbyshire
...	...	...	...	...	...	...
108	Snailsden Pike End	477.2	28.5	53.527289	-1.807632	Barnsley
109	Sough Top	438.5	16.3	53.234919	-1.802053	Derbyshire
110	Stanedge Pole	444.1	17.1	53.356177	-1.630601	Derbyshire
111	Alphin Pike	469.6	0.9	53.522014	-1.996965	Tameside
112	Cheeks Hill	521.7	5.1	53.226655	-1.961680	Derbyshire

113 rows × 6 columns

	Height	Drop	Latitude	Longitude
count	113.000000	113.000000	113.000000	113.000000
mean	446.914159	63.879646	53.298625	-1.843522
std	84.105091	59.688096	0.131592	0.122374
min	235.000000	0.000000	53.047864	-2.144702
25%	380.000000	30.600000	53.203591	-1.945962
50%	450.000000	51.000000	53.291318	-1.841480
75%	514.000000	85.000000	53.380700	-1.751426
max	636.300000	496.600000	53.586204	-1.562219

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Name       113 non-null    object 
 1   Height     113 non-null    float64
 2   Drop       113 non-null    float64
 3   Latitude   113 non-null    float64
 4   Longitude  113 non-null    float64
 5   County     113 non-null    object 
dtypes: float64(4), object(2)
memory usage: 5.4+ KB

None

Columns: ['Name', 'Height', 'Drop', 'Latitude', 'Longitude', 'County']
Shape: (113, 6)

The plot below shows the height of the various hills in ascending order. We use df['Height'] to get the series of heights, and then sort_values() to sort them in order. However, this still leaves us with a series in which every entry is tagged with an index number giving its position in the original dataset. If we leave that index number in place, then the plotting methods will plot the heights in the order given by that index number. We therefore use reset_index(drop=True) to get rid of the index numbers before plotting.

In [4]:

df['Height'].sort_values().reset_index(drop=True).plot()

Out[4]:

<Axes: >

No description has been provided for this image

We next draw a histogram of the heights of the hills.

In [5]:

df['Height'].hist(bins=np.arange(200,700,50), legend=True)

Out[5]:

<Axes: >

Here we display a more complex histogram showing both the height and the drop. (The drop is the amount of height that you lose when descending from one hill to the highest saddle that separates it from an adjacent hill.)

In [6]:

df[['Height','Drop']].plot.hist()

Out[6]:

<Axes: ylabel='Frequency'>

Here is a list of all the words that appear in the County column. We first use df.County.unique() (or alternatively df['County].unique()) to get a numpy array containing all the values in the County column, without repetitions. This contains many strings like "Derbyshire/Kirklees". We use s.split('/') to split each such string at the '/' character. This will produce some more repetitions, but we do this inside a comprehension {w for s in ...} with curly brackets which produces a set rather than a list and so removes the repetitions. We then wrap this in sorted(list(...)) so that the end result is a sorted list.

In [7]:

county_words = sorted(list({w for s in df.County.unique() for w in s.split('/')}))
county_words

Out[7]:

['Barnsley',
 'Cheshire East',
 'Derbyshire',
 'Kirklees',
 'Oldham',
 'Sheffield',
 'Staffordshire',
 'Stockport',
 'Tameside']

We now add a new column to the data frame containing the distances (in km) from the hills to the Hicks Building. Note that positions are recorded in the data frame as latitude and longitude, and it is quite complicated to calculate distances from latitudes and longitudes, especially if you want to be accurate and take account of the fact that the earth is not precisely spherical. Fortunately the function geopy.distance.distance() will do the work for us.

In [30]:

hicks_position = 53.38085, -1.48636
def hicks_distance(row):
    return geopy.distance.distance(hicks_position, (row["Latitude"], row["Longitude"])).km
df["Distance"] = df.apply(hicks_distance, axis=1)

We now want to tidy up the names of the hills. Some of them are like "Howden Edge [High Stones]" with extra text in square or round brackets, or after a dash. We remove all such extra text, and then remove any spaces from the beginning and end. Note that when applying string methods to the Name column, we have to call it df["Name"].str and not just df["Name"].

In [33]:

df["Name"] = df["Name"].str.replace(r"\([A-Za-z' ]*\)","",regex=True)
df["Name"] = df["Name"].str.replace(r"\[[A-Za-z' ]*\]","",regex=True)
df["Name"] = df["Name"].str.replace(r"- [A-Za-z' ]*","",regex=True)
df["Name"] = df["Name"].str.strip()

We next add a new column containing the type of each feature, which is just the last word in the name. Very often this is "Hill", but sometimes it is "Tor" or "Moor" or something else. By default the new column will appear as the last column in the dataframe, but the second line below rearranges the columns so that the type is the second column, directly after the name.

In [35]:

def last_word(s):
    return s.split(' ')[-1]

df["Type"] = df["Name"].map(last_word)
df = df[["Name","Type","Height","Drop","Latitude","Longitude","Distance","County"]]

Here is the full list of types. We have added some extra code to lay it out with ten words per line, so that we do not need to scroll horizontally or vertically.

In [37]:

types = list(df["Type"].unique())
print("\n".join([' '.join(types[10*i:10*(i+1)]) for i in range(len(types)//10+1)]))

Scout Head Stones Hill Tor Gun Cloud Moor Knoll Edge
Seat Ridge Top Naze Shutlingsloe Roaches Nab Neb Churn Pike
Bolehill Rods Moss Low End Cop Cliff Famine Bank Wheeldon
Revidge Rocks Folly Lad Pole

We now make a new data frame of types, together with the number of features of each type and their average height. The groupby method produces an object of class pandas.DataFrameGroupBy, which is essentially a list of dataframes, each containing all the hills of one type. The agg() method then aggregates information from each of these smaller dataframes, producing a new dataframe with one row for each type. The argument count=('Name', 'size') means that the new dataframe should have a column called count, and the entries in that column should be the sizes of the corresponding groups of names. For example, there are nine hills of type "Moor", so the new dataset will have a row with Type="Moor" and count=9. Similarly, the argument height=('Metres', 'mean') indicates that the new dataframe should have a column called height, and the entries in that column should be the average height of the hills of the relevant type.

There are various types for which there are only one or two hills. The code dft[dft["count"] > 2] gives a smaller dataframe in which those types are excluded, and we print the remaining types in descending order of their average heights.

In [41]:

dft = df.groupby("Type") \
        .agg(count=('Name', 'size'), height=('Height', 'mean')) \
        .sort_values(by='height', ascending=False) \
        .reset_index()

dft[dft["count"] > 2]

Out[41]:

	Type	count	height
5	Head	4	543.000000
8	Tor	5	514.160000
12	Moss	3	498.300000
16	Edge	12	469.825000
20	Low	8	431.375000
21	Hill	33	427.248485
25	Pike	3	404.200000
26	Moor	9	403.622222
30	Cloud	3	346.666667

The next three lines all give the first five rows of the dataframe. This is because the index column at the left hand end of the dataframe just contains the numbers 0 to 113 in order. Sometimes we will deal with dataframes in which the index column is different, containing names or labels of some kind. In that context, we would use df.loc[] with labels and df.iloc[] with row numbers.

In [14]:

display(df[0:5])
display(df.loc[0:5])
display(df.iloc[0:5])

	Name	Type	Height	Drop	Latitude	Longitude	Distance	County
0	Kinder Scout	Scout	636.3	496.6	53.384808	-1.873910	25.792033	Derbyshire
1	Bleaklow Head	Head	633.0	128.0	53.461267	-1.858959	26.338430	Derbyshire
2	Higher Shelf Stones	Stones	621.8	19.8	53.449816	-1.867610	26.486259	Derbyshire
3	Black Hill	Hill	582.0	165.0	53.538013	-1.885306	31.751218	Derbyshire/Kirklees
4	Shining Tor	Tor	559.0	235.7	53.260729	-2.009472	37.334829	Cheshire East/Derbyshire

	Name	Type	Height	Drop	Latitude	Longitude	Distance	County
0	Kinder Scout	Scout	636.3	496.6	53.384808	-1.873910	25.792033	Derbyshire
1	Bleaklow Head	Head	633.0	128.0	53.461267	-1.858959	26.338430	Derbyshire
2	Higher Shelf Stones	Stones	621.8	19.8	53.449816	-1.867610	26.486259	Derbyshire
3	Black Hill	Hill	582.0	165.0	53.538013	-1.885306	31.751218	Derbyshire/Kirklees
4	Shining Tor	Tor	559.0	235.7	53.260729	-2.009472	37.334829	Cheshire East/Derbyshire
5	Win Hill	Hill	463.4	144.7	53.362373	-1.720964	15.749905	Derbyshire

	Name	Type	Height	Drop	Latitude	Longitude	Distance	County
0	Kinder Scout	Scout	636.3	496.6	53.384808	-1.873910	25.792033	Derbyshire
1	Bleaklow Head	Head	633.0	128.0	53.461267	-1.858959	26.338430	Derbyshire
2	Higher Shelf Stones	Stones	621.8	19.8	53.449816	-1.867610	26.486259	Derbyshire
3	Black Hill	Hill	582.0	165.0	53.538013	-1.885306	31.751218	Derbyshire/Kirklees
4	Shining Tor	Tor	559.0	235.7	53.260729	-2.009472	37.334829	Cheshire East/Derbyshire

In [15]:

# Some of the columns from a single row
display(df.loc[3,['Latitude','Longitude']])

Latitude     53.538013
Longitude    -1.885306
Name: 3, dtype: object

In [16]:

# All the columns from the top row
df.loc[0]

Out[16]:

Name         Kinder Scout
Type                Scout
Height              636.3
Drop                496.6
Latitude        53.384808
Longitude        -1.87391
Distance        25.792033
County         Derbyshire
Name: 0, dtype: object

When referring to a single entry in the dataframe, it is more efficient to use df.at or df.iat instead of df.loc or df.iloc.

In [17]:

print(df.at[3,"Latitude"])
print(df.iat[3,4])

53.538013
53.538013

Count the number of hills where we have to descend by more than 100m before reaching a saddle.

In [18]:

len(df[df["Drop"] > 100])

Out[18]:

Show the list of hills where the county name contains the string "Sheffield".

In [19]:

df[df["County"].str.contains("Sheffield")]

Out[19]:

	Name	Type	Height	Drop	Latitude	Longitude	Distance	County
11	Howden Edge	Edge	550.0	62.0	53.445603	-1.718567	17.039469	Sheffield
18	Back Tor	Tor	538.0	67.0	53.415411	-1.704127	14.987408	Derbyshire/Sheffield
19	Horse Stone Naze	Naze	527.0	32.0	53.474428	-1.763184	21.143873	Sheffield
28	High Neb	Neb	458.0	107.0	53.364684	-1.659110	11.637730	Derbyshire/Sheffield
39	Hoar Stones	Stones	514.0	7.0	53.481348	-1.762573	21.497691	Barnsley/Sheffield
44	Outer Edge	Edge	541.0	23.0	53.469227	-1.734726	19.218125	Sheffield
97	Margery Hill	Hill	546.0	19.0	53.457724	-1.716664	17.539911	Sheffield
106	Higger Tor	Tor	434.8	22.0	53.333791	-1.615132	10.046918	Sheffield
107	Lost Lad	Lad	519.0	10.0	53.417404	-1.710523	15.455555	Derbyshire/Sheffield

Plot the positions of the hills. The sizes of the discs are determined by the heights of the hills, and the colours are determined by the drop.

In [20]:

df.assign(s=lambda x: 0.2*x.Height).plot.scatter(x='Longitude', y='Latitude', c='Drop', s='s', colormap='viridis')

Out[20]:

<Axes: xlabel='Longitude', ylabel='Latitude'>

In [ ]: