Peak district hills¶
import numpy as np
import pandas as pd
import geopy.distance
We use pandas to import a CSV file containing details of hills in the Peak District.
(This is a subset of the much larger database of British and Irish hills which you can find at https://www.hills-database.co.uk/.)
data_dir = '../data/misc'
df = pd.read_csv(data_dir + '/peak_district_hills.csv')
Here are several ways to get a general picture of what is in the data frame.
display(df)
display(df.describe())
display(df.info())
print(f"Columns: {list(df.columns)}")
print(f"Shape: {df.shape}")
| Name | Height | Drop | Latitude | Longitude | County | |
|---|---|---|---|---|---|---|
| 0 | Kinder Scout | 636.3 | 496.6 | 53.384808 | -1.873910 | Derbyshire |
| 1 | Bleaklow Head | 633.0 | 128.0 | 53.461267 | -1.858959 | Derbyshire |
| 2 | Higher Shelf Stones | 621.8 | 19.8 | 53.449816 | -1.867610 | Derbyshire |
| 3 | Black Hill | 582.0 | 165.0 | 53.538013 | -1.885306 | Derbyshire/Kirklees |
| 4 | Shining Tor | 559.0 | 235.7 | 53.260729 | -2.009472 | Cheshire East/Derbyshire |
| ... | ... | ... | ... | ... | ... | ... |
| 108 | Snailsden Pike End | 477.2 | 28.5 | 53.527289 | -1.807632 | Barnsley |
| 109 | Sough Top | 438.5 | 16.3 | 53.234919 | -1.802053 | Derbyshire |
| 110 | Stanedge Pole | 444.1 | 17.1 | 53.356177 | -1.630601 | Derbyshire |
| 111 | Alphin Pike | 469.6 | 0.9 | 53.522014 | -1.996965 | Tameside |
| 112 | Cheeks Hill | 521.7 | 5.1 | 53.226655 | -1.961680 | Derbyshire |
113 rows × 6 columns
| Height | Drop | Latitude | Longitude | |
|---|---|---|---|---|
| count | 113.000000 | 113.000000 | 113.000000 | 113.000000 |
| mean | 446.914159 | 63.879646 | 53.298625 | -1.843522 |
| std | 84.105091 | 59.688096 | 0.131592 | 0.122374 |
| min | 235.000000 | 0.000000 | 53.047864 | -2.144702 |
| 25% | 380.000000 | 30.600000 | 53.203591 | -1.945962 |
| 50% | 450.000000 | 51.000000 | 53.291318 | -1.841480 |
| 75% | 514.000000 | 85.000000 | 53.380700 | -1.751426 |
| max | 636.300000 | 496.600000 | 53.586204 | -1.562219 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113 entries, 0 to 112 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 113 non-null object 1 Height 113 non-null float64 2 Drop 113 non-null float64 3 Latitude 113 non-null float64 4 Longitude 113 non-null float64 5 County 113 non-null object dtypes: float64(4), object(2) memory usage: 5.4+ KB
None
Columns: ['Name', 'Height', 'Drop', 'Latitude', 'Longitude', 'County'] Shape: (113, 6)
The plot below shows the height of the various hills in ascending order. We use df['Height'] to get the series of heights, and then sort_values() to sort them in order. However, this still leaves us with a series in which every entry is tagged with an index number giving its position in the original dataset. If we leave that index number in place, then the plotting methods will plot the heights in the order given by that index number. We therefore use reset_index(drop=True) to get rid of the index numbers before plotting.
df['Height'].sort_values().reset_index(drop=True).plot()
<Axes: >
We next draw a histogram of the heights of the hills.
df['Height'].hist(bins=np.arange(200,700,50), legend=True)
<Axes: >
Here we display a more complex histogram showing both the height and the drop. (The drop is the amount of height that you lose when descending from one hill to the highest saddle that separates it from an adjacent hill.)
df[['Height','Drop']].plot.hist()
<Axes: ylabel='Frequency'>
Here is a list of all the words that appear in the County column. We first use df.County.unique() (or alternatively df['County].unique()) to get a numpy array containing all the values in the County column, without repetitions. This contains many strings like "Derbyshire/Kirklees". We use s.split('/') to split each such string at the '/' character. This will produce some more repetitions, but we do this inside a comprehension {w for s in ...} with curly brackets which produces a set rather than a list and so removes the repetitions. We then wrap this in sorted(list(...)) so that the end result is a sorted list.
county_words = sorted(list({w for s in df.County.unique() for w in s.split('/')}))
county_words
['Barnsley', 'Cheshire East', 'Derbyshire', 'Kirklees', 'Oldham', 'Sheffield', 'Staffordshire', 'Stockport', 'Tameside']
We now add a new column to the data frame containing the distances (in km) from the hills to the Hicks Building. Note that positions are recorded in the data frame as latitude and longitude, and it is quite complicated to calculate distances from latitudes and longitudes, especially if you want to be accurate and take account of the fact that the earth is not precisely spherical. Fortunately the function geopy.distance.distance() will do the work for us.
hicks_position = 53.38085, -1.48636
def hicks_distance(row):
return geopy.distance.distance(hicks_position, (row["Latitude"], row["Longitude"])).km
df["Distance"] = df.apply(hicks_distance, axis=1)
We now want to tidy up the names of the hills. Some of them are like "Howden Edge [High Stones]" with extra text in square or round brackets, or after a dash. We remove all such extra text, and then remove any spaces from the beginning and end. Note that when applying string methods to the Name column, we have to call it df["Name"].str and not just df["Name"].
df["Name"] = df["Name"].str.replace(r"\([A-Za-z' ]*\)","",regex=True)
df["Name"] = df["Name"].str.replace(r"\[[A-Za-z' ]*\]","",regex=True)
df["Name"] = df["Name"].str.replace(r"- [A-Za-z' ]*","",regex=True)
df["Name"] = df["Name"].str.strip()
We next add a new column containing the type of each feature, which is just the last word in the name. Very often this is "Hill", but sometimes it is "Tor" or "Moor" or something else. By default the new column will appear as the last column in the dataframe, but the second line below rearranges the columns so that the type is the second column, directly after the name.
def last_word(s):
return s.split(' ')[-1]
df["Type"] = df["Name"].map(last_word)
df = df[["Name","Type","Height","Drop","Latitude","Longitude","Distance","County"]]
Here is the full list of types. We have added some extra code to lay it out with ten words per line, so that we do not need to scroll horizontally or vertically.
types = list(df["Type"].unique())
print("\n".join([' '.join(types[10*i:10*(i+1)]) for i in range(len(types)//10+1)]))
Scout Head Stones Hill Tor Gun Cloud Moor Knoll Edge Seat Ridge Top Naze Shutlingsloe Roaches Nab Neb Churn Pike Bolehill Rods Moss Low End Cop Cliff Famine Bank Wheeldon Revidge Rocks Folly Lad Pole
We now make a new data frame of types, together with the number of features of each type and their average height. The groupby method produces an object of class pandas.DataFrameGroupBy, which is essentially a list of dataframes, each containing all the hills of one type. The agg() method then aggregates information from each of these smaller dataframes, producing a new dataframe with one row for each type. The argument count=('Name', 'size') means that the new dataframe should have a column called count, and the entries in that column should be the sizes of the corresponding groups of names. For example, there are nine hills of type "Moor", so the new dataset will have a row with Type="Moor" and count=9. Similarly, the argument height=('Metres', 'mean') indicates that the new dataframe should have a column called height, and the entries in that column should be the average height of the hills of the relevant type.
There are various types for which there are only one or two hills. The code dft[dft["count"] > 2] gives a smaller dataframe in which those types are excluded, and we print the remaining types in descending order of their average heights.
dft = df.groupby("Type") \
.agg(count=('Name', 'size'), height=('Height', 'mean')) \
.sort_values(by='height', ascending=False) \
.reset_index()
dft[dft["count"] > 2]
| Type | count | height | |
|---|---|---|---|
| 5 | Head | 4 | 543.000000 |
| 8 | Tor | 5 | 514.160000 |
| 12 | Moss | 3 | 498.300000 |
| 16 | Edge | 12 | 469.825000 |
| 20 | Low | 8 | 431.375000 |
| 21 | Hill | 33 | 427.248485 |
| 25 | Pike | 3 | 404.200000 |
| 26 | Moor | 9 | 403.622222 |
| 30 | Cloud | 3 | 346.666667 |
The next three lines all give the first five rows of the dataframe. This is because the index column at the left hand end of the dataframe just contains the numbers 0 to 113 in order. Sometimes we will deal with dataframes in which the index column is different, containing names or labels of some kind. In that context, we would use df.loc[] with labels and df.iloc[] with row numbers.
display(df[0:5])
display(df.loc[0:5])
display(df.iloc[0:5])
| Name | Type | Height | Drop | Latitude | Longitude | Distance | County | |
|---|---|---|---|---|---|---|---|---|
| 0 | Kinder Scout | Scout | 636.3 | 496.6 | 53.384808 | -1.873910 | 25.792033 | Derbyshire |
| 1 | Bleaklow Head | Head | 633.0 | 128.0 | 53.461267 | -1.858959 | 26.338430 | Derbyshire |
| 2 | Higher Shelf Stones | Stones | 621.8 | 19.8 | 53.449816 | -1.867610 | 26.486259 | Derbyshire |
| 3 | Black Hill | Hill | 582.0 | 165.0 | 53.538013 | -1.885306 | 31.751218 | Derbyshire/Kirklees |
| 4 | Shining Tor | Tor | 559.0 | 235.7 | 53.260729 | -2.009472 | 37.334829 | Cheshire East/Derbyshire |
| Name | Type | Height | Drop | Latitude | Longitude | Distance | County | |
|---|---|---|---|---|---|---|---|---|
| 0 | Kinder Scout | Scout | 636.3 | 496.6 | 53.384808 | -1.873910 | 25.792033 | Derbyshire |
| 1 | Bleaklow Head | Head | 633.0 | 128.0 | 53.461267 | -1.858959 | 26.338430 | Derbyshire |
| 2 | Higher Shelf Stones | Stones | 621.8 | 19.8 | 53.449816 | -1.867610 | 26.486259 | Derbyshire |
| 3 | Black Hill | Hill | 582.0 | 165.0 | 53.538013 | -1.885306 | 31.751218 | Derbyshire/Kirklees |
| 4 | Shining Tor | Tor | 559.0 | 235.7 | 53.260729 | -2.009472 | 37.334829 | Cheshire East/Derbyshire |
| 5 | Win Hill | Hill | 463.4 | 144.7 | 53.362373 | -1.720964 | 15.749905 | Derbyshire |
| Name | Type | Height | Drop | Latitude | Longitude | Distance | County | |
|---|---|---|---|---|---|---|---|---|
| 0 | Kinder Scout | Scout | 636.3 | 496.6 | 53.384808 | -1.873910 | 25.792033 | Derbyshire |
| 1 | Bleaklow Head | Head | 633.0 | 128.0 | 53.461267 | -1.858959 | 26.338430 | Derbyshire |
| 2 | Higher Shelf Stones | Stones | 621.8 | 19.8 | 53.449816 | -1.867610 | 26.486259 | Derbyshire |
| 3 | Black Hill | Hill | 582.0 | 165.0 | 53.538013 | -1.885306 | 31.751218 | Derbyshire/Kirklees |
| 4 | Shining Tor | Tor | 559.0 | 235.7 | 53.260729 | -2.009472 | 37.334829 | Cheshire East/Derbyshire |
# Some of the columns from a single row
display(df.loc[3,['Latitude','Longitude']])
Latitude 53.538013 Longitude -1.885306 Name: 3, dtype: object
# All the columns from the top row
df.loc[0]
Name Kinder Scout Type Scout Height 636.3 Drop 496.6 Latitude 53.384808 Longitude -1.87391 Distance 25.792033 County Derbyshire Name: 0, dtype: object
When referring to a single entry in the dataframe, it is more efficient to use df.at or df.iat instead of df.loc or df.iloc.
print(df.at[3,"Latitude"])
print(df.iat[3,4])
53.538013 53.538013
Count the number of hills where we have to descend by more than 100m before reaching a saddle.
len(df[df["Drop"] > 100])
20
Show the list of hills where the county name contains the string "Sheffield".
df[df["County"].str.contains("Sheffield")]
| Name | Type | Height | Drop | Latitude | Longitude | Distance | County | |
|---|---|---|---|---|---|---|---|---|
| 11 | Howden Edge | Edge | 550.0 | 62.0 | 53.445603 | -1.718567 | 17.039469 | Sheffield |
| 18 | Back Tor | Tor | 538.0 | 67.0 | 53.415411 | -1.704127 | 14.987408 | Derbyshire/Sheffield |
| 19 | Horse Stone Naze | Naze | 527.0 | 32.0 | 53.474428 | -1.763184 | 21.143873 | Sheffield |
| 28 | High Neb | Neb | 458.0 | 107.0 | 53.364684 | -1.659110 | 11.637730 | Derbyshire/Sheffield |
| 39 | Hoar Stones | Stones | 514.0 | 7.0 | 53.481348 | -1.762573 | 21.497691 | Barnsley/Sheffield |
| 44 | Outer Edge | Edge | 541.0 | 23.0 | 53.469227 | -1.734726 | 19.218125 | Sheffield |
| 97 | Margery Hill | Hill | 546.0 | 19.0 | 53.457724 | -1.716664 | 17.539911 | Sheffield |
| 106 | Higger Tor | Tor | 434.8 | 22.0 | 53.333791 | -1.615132 | 10.046918 | Sheffield |
| 107 | Lost Lad | Lad | 519.0 | 10.0 | 53.417404 | -1.710523 | 15.455555 | Derbyshire/Sheffield |
Plot the positions of the hills. The sizes of the discs are determined by the heights of the hills, and the colours are determined by the drop.
df.assign(s=lambda x: 0.2*x.Height).plot.scatter(x='Longitude', y='Latitude', c='Drop', s='s', colormap='viridis')
<Axes: xlabel='Longitude', ylabel='Latitude'>