Plotting butterflies¶

In [1]:
import requests
from PIL import Image, ImageDraw, ImageFont
import random

The point of this task is to fetch and process data and images from a source on the internet. We use the requests library to fetch data about butterflies (originally supplied by the Smithsonian Institute) from the Huggingface repository. An overview of the dataset can be seen at https://huggingface.co/datasets/ceyda/smithsonian_butterflies. We will just retrieve the first 100 records. From the example data listed at the above URL, you will see that each record has an image URL and also an entry called taxonomy which is a string like "Animalia, Arthropoda, Hexapoda, Insecta, Lepidoptera, Noctuidae, Catocalinae". This indicates that the relevant butterfly is a member of the kingdom Animalia, the phylum Arthropoda, the subphylum Hexapoda and so on. The groups Animalia, Arthropoda and Hexapoda are examples of taxa (which is the plural of taxon).

The code below constructs a list called data and a dict called data_by_taxon. Each entry in the list data is a dict containing one record from the dataset, with information about one image. The entries in the dict data_by_taxon are similar lists, but restricted to a particular taxon. For example, data_by_taxon['Catocalinae'] is a list of records describing images of moths in the subfamily Catocalinae.

We use the requests library to fetch the data; this is documented at requests.readthedocs.io.

In [2]:
def get_data():
    """Downloads the data from the Smithsonian Butterflies dataset"""
    global dataset
    base_url = 'https://datasets-server.huggingface.co/rows'
    dataset_id = 'ceyda/smithsonian_butterflies'
    url = f'{base_url}?dataset={dataset_id}&config=default&split=train&offset=0&length=100'
    # The method requests.get(url) returns a requests.Response object
    # The method .json() converts the response to a Python dictionary
    dataset = requests.get(url).json()
    # The dataset contains various metadata that is not useful for us.
    # We are only interested in the data itself, which is stored in the 'rows' key.
    data =  [x['row'] for x in dataset['rows']]
    # We now split the data by taxon
    data_by_taxon = {'All' : data}
    families = {}
    for d in data:
        # d['taxonomy'] is a string containing a list of taxa separated by commas
        # We use the split() method to convert it to a list of taxa
        taxa = d['taxonomy'].split(', ')
        family = taxa[5]
        if family in families:
            families[family] += 1
        else:
            families[family] = 1
        for taxon in taxa:
            if taxon not in data_by_taxon:
                # If the taxon is not already in the dictionary, we add it
                data_by_taxon[taxon] = []
            data_by_taxon[taxon].append(d) 
    return data_by_taxon

We now define a function to display a randomly chosen image. If we specify a taxon as an argument to the function, then the image will be chosen randomly from the members of that taxon.

In [5]:
def show_random_image(taxon = 'All'):
    """
    Displays a random image of the given taxon, with the name of the image superimposed.
    
    Parameters
    ----------
    taxon : str
        The taxon of the butterfly to display. 
        If 'All', a random image of any taxon is displayed.
    
    Returns
    -------
    None
    """
    global data_by_taxon
    # If we have not already downloaded the data, we do it now
    if not('data_by_taxon' in globals()):
        data_by_taxon = get_data()
    if taxon not in data_by_taxon:
        print(f'No images for taxon {taxon}')
        return
    data = data_by_taxon[taxon]
    # data is now a list of dicts, each dict representing one image
    # We choose one of these dicts at random
    random_image = random.choice(data)
    # We now download the associated image using the requests and PIL libraries
    image_url = random_image['image_url']
    image0 = Image.open(requests.get(image_url, stream=True).raw)
    # We now resize the image to a width of 500 pixels (keeping the aspect ratio)
    image0.thumbnail((500, 10000))
    # We now want to add a white bar 100 pixels high at the bottom of the image, so
    # that we can add some text there.  To do this, we create a new blank image of
    # the required size, and paste the original image at the top of it.
    w, h = image0.size
    image = Image.new('RGB', (w, h + 100), color=(255, 255, 255))
    # In image processing it is standard to use a vertical coordinate system where
    # the top of the image counts as y=0 and the rest of the image has y>0.
    # Thus, to paste the original image at the top of the new image, we use the
    # coordinates (0, 0).
    image.paste(image0, (0, 0))
    # We now create an object called draw that we can use to draw on the image
    draw = ImageDraw.Draw(image)
    # We use this to add the name of the butterfly to the image
    font = ImageFont.truetype('arial.ttf', 24)
    draw.text((10, h + 30), random_image['name'], fill=(0, 0, 0), font=font)
    # Finally, we make the image visible.
    # display(image)
    return image

We now fetch the data and print the list of all taxa.

In [7]:
data_by_taxon = get_data()
taxa = sorted(list(data_by_taxon.keys()))
n = len(taxa)
for i in range(0,n,10):
    print(', '.join(taxa[i:i+10]))
Acrolophidae, All, Animalia, Arctiidae, Arthropoda, Catocalinae, Charaxinae, Coliadinae, Cossidae, Danainae
Geometridae, Heliconiinae, Hesperiidae, Hesperiinae, Hexapoda, Hymenoptera, Insecta, Lepidoptera, Limenitidinae, Lycaenidae
Lycaeninae, Macroglossinae, Noctuidae, Noctuinae, Nymphalidae, Nymphalinae, Papilionidae, Papilioninae, Pieridae, Pierinae
Polyommatinae, Pyrginae, Saturniidae, Satyrinae, Sphingidae, Sphinginae, Theclinae, Vespidae, Vespinae

Finally, we display a random image.

In [9]:
img = show_random_image('All')
img
Out[9]:
No description has been provided for this image
In [ ]: