Plotting butterflies¶
import requests
from PIL import Image, ImageDraw, ImageFont
import random
The point of this task is to fetch and process data and images from a source on the internet. We use the requests library to fetch data about butterflies (originally supplied by the Smithsonian Institute) from the Huggingface repository. An overview of the dataset can be seen at https://huggingface.co/datasets/ceyda/smithsonian_butterflies. We will just retrieve the first 100 records. From the example data listed at the above URL, you will see that each record has an image URL and also an entry called taxonomy which is a string like "Animalia, Arthropoda, Hexapoda, Insecta, Lepidoptera, Noctuidae, Catocalinae". This indicates that the relevant butterfly is a member of the kingdom Animalia, the phylum Arthropoda, the subphylum Hexapoda and so on. The groups Animalia, Arthropoda and Hexapoda are examples of taxa (which is the plural of taxon).
The code below constructs a list called data and a dict called data_by_taxon. Each entry in the list data is a dict containing one record from the dataset, with information about one image. The entries in the dict data_by_taxon are similar lists, but restricted to a particular taxon. For example, data_by_taxon['Catocalinae'] is a list of records describing images of moths in the subfamily Catocalinae.
We use the requests library to fetch the data; this is documented at requests.readthedocs.io.
def get_data():
"""Downloads the data from the Smithsonian Butterflies dataset"""
global dataset
base_url = 'https://datasets-server.huggingface.co/rows'
dataset_id = 'ceyda/smithsonian_butterflies'
url = f'{base_url}?dataset={dataset_id}&config=default&split=train&offset=0&length=100'
# The method requests.get(url) returns a requests.Response object
# The method .json() converts the response to a Python dictionary
dataset = requests.get(url).json()
# The dataset contains various metadata that is not useful for us.
# We are only interested in the data itself, which is stored in the 'rows' key.
data = [x['row'] for x in dataset['rows']]
# We now split the data by taxon
data_by_taxon = {'All' : data}
families = {}
for d in data:
# d['taxonomy'] is a string containing a list of taxa separated by commas
# We use the split() method to convert it to a list of taxa
taxa = d['taxonomy'].split(', ')
family = taxa[5]
if family in families:
families[family] += 1
else:
families[family] = 1
for taxon in taxa:
if taxon not in data_by_taxon:
# If the taxon is not already in the dictionary, we add it
data_by_taxon[taxon] = []
data_by_taxon[taxon].append(d)
return data_by_taxon
We now define a function to display a randomly chosen image. If we specify a taxon as an argument to the function, then the image will be chosen randomly from the members of that taxon.
def show_random_image(taxon = 'All'):
"""
Displays a random image of the given taxon, with the name of the image superimposed.
Parameters
----------
taxon : str
The taxon of the butterfly to display.
If 'All', a random image of any taxon is displayed.
Returns
-------
None
"""
global data_by_taxon
# If we have not already downloaded the data, we do it now
if not('data_by_taxon' in globals()):
data_by_taxon = get_data()
if taxon not in data_by_taxon:
print(f'No images for taxon {taxon}')
return
data = data_by_taxon[taxon]
# data is now a list of dicts, each dict representing one image
# We choose one of these dicts at random
random_image = random.choice(data)
# We now download the associated image using the requests and PIL libraries
image_url = random_image['image_url']
image0 = Image.open(requests.get(image_url, stream=True).raw)
# We now resize the image to a width of 500 pixels (keeping the aspect ratio)
image0.thumbnail((500, 10000))
# We now want to add a white bar 100 pixels high at the bottom of the image, so
# that we can add some text there. To do this, we create a new blank image of
# the required size, and paste the original image at the top of it.
w, h = image0.size
image = Image.new('RGB', (w, h + 100), color=(255, 255, 255))
# In image processing it is standard to use a vertical coordinate system where
# the top of the image counts as y=0 and the rest of the image has y>0.
# Thus, to paste the original image at the top of the new image, we use the
# coordinates (0, 0).
image.paste(image0, (0, 0))
# We now create an object called draw that we can use to draw on the image
draw = ImageDraw.Draw(image)
# We use this to add the name of the butterfly to the image
font = ImageFont.truetype('arial.ttf', 24)
draw.text((10, h + 30), random_image['name'], fill=(0, 0, 0), font=font)
# Finally, we make the image visible.
# display(image)
return image
We now fetch the data and print the list of all taxa.
data_by_taxon = get_data()
taxa = sorted(list(data_by_taxon.keys()))
n = len(taxa)
for i in range(0,n,10):
print(', '.join(taxa[i:i+10]))
Acrolophidae, All, Animalia, Arctiidae, Arthropoda, Catocalinae, Charaxinae, Coliadinae, Cossidae, Danainae Geometridae, Heliconiinae, Hesperiidae, Hesperiinae, Hexapoda, Hymenoptera, Insecta, Lepidoptera, Limenitidinae, Lycaenidae Lycaeninae, Macroglossinae, Noctuidae, Noctuinae, Nymphalidae, Nymphalinae, Papilionidae, Papilioninae, Pieridae, Pierinae Polyommatinae, Pyrginae, Saturniidae, Satyrinae, Sphingidae, Sphinginae, Theclinae, Vespidae, Vespinae
Finally, we display a random image.
img = show_random_image('All')
img