MAS2008 Scientific Computing: Lab 6
The Python ecosystem

Because there is an assignment due this week, there is only one question (corresponding to Task 1) on the online test.

Instructions

The following notebooks and videos are relevant for this lab:


Map of China	View	Download
List of links	View	Download
List notebooks	View	Download
Read a Microsoft Word file	View	Download

For the map of China notebook, you will also need the file china_provinces.json. For the Microsoft Word notebook, you will need a sample Word file. Here are two small files that you can use for testing: mouse.docx and abc.docx. In both cases, you will need to adjust the notebook slightly to reflect where you have saved the files.

Task 1: Spreadsheets

Write a function create_spreadsheet(filename, strings). As an example, calling


    create_spreadsheet(
     'cheese.xlsx', 
     ['Cheddar', 'Wensleydale', 'Stilton']
    )

should create and save a spreadsheet file called cheese.xlsx looking like this:

	A	B
1	String	Length
2	Cheddar	7
3	Wensleydale	11
4	Stilton	7
5	Total	25

You can assume that filename will be a string ending with .xlsx and strings will be a list of strings.
The top row should contain the headers "String" and "Length"
Then there should be a row for each entry in the list strings, consisting of the string itself and its length.
The final row should contain the word "Total" and the total length of all the strings. You should not calculate the total in Python, but instead use a formula in the spreadsheet. In the example above, the formula in cell B5 would be =SUM(B2:B4).
You should use the openpyxl library to do all this. On your own PC or laptop, if you installed the packages in the file requirements.txt as explained here, then you will already have this library installed. On the university PCs you may need to install it as follows: at the top of the VS Code window click on "Terminal" then "New Terminal" and type pip install openpyxl.

Your code will probably start like this:


     import openpyxl

     def create_spreadsheet(filename, strings):
         wb = openpyxl.Workbook()
         # ... etc ...

You can read the documentation or ask Google Gemini for help with the details.

You should enter your code in the online test on Moodle.

Task 2: Image manipulation

Download and save the image python_facing_left.jpg. Create a mirror image and save it as python_facing_right.jpg. Display both images in your Jupyter notebook.

You will need to find a suitable Python library to perform this task. Options include Pillow and skimage and possibly others. You could read the documentation or try asking Google Gemini for instructions. I found that I had to rephrase the request a few times before I got a useful answer.

Task 3: Butterflies

At https://huggingface.co/datasets/ceyda/smithsonian_butterflies you will find information about a dataset of about 9000 images of butterflies. You will see a box with vertical and horizontal scroll bars; you will need to scroll all the way to the right to see the actual images. (Huggingface also has a very large collection of other datasets of all kinds.)

Now visit the following URL:

https://datasets-server.huggingface.co/rows?dataset=ceyda/smithsonian_butterflies&config=default&split=train&offset=0&length=5

This will show you information about the first 5 images in the dataset, in JSON format, which is a standard and convenient format for data that is intended to be parsed by software rather than read by humans. Note that this data contains URLs for the individual images, but does not contain the images themselves.

We can import the first 100 rows into Python and analyse them as follows.

Import the requests library to allow us to fetch data from the web.
Let url be the URL as above, but with length=5 changed to length=100.
Enter dataset=requests.get(url).json() then data=dataset['rows'] to fetch the data.
Work out how to extract the URL of the image of the butterfly in row 42. Entering type(data) will tell you that data is a list. Entering type(data[42]) will tell you that data[42] is a dict. Entering data[42].keys() will tell you the keys of the dict. The first key is row_idx, so you can enter data[42]['row_idx'], but that just gives you the number 42 again, which is not useful; so you need to try the other keys.
After you have found the image URL and called it image_url, you can display the image in your notebook by entering from PIL import Image then img=Image.open(requests.get(image_url, stream=True).raw) then img.
Now tidy things up: put the code to download the data into a function get_data() which returns data, and define a function show_butterfly(data,k) which downloads and displays the image of the butterfly in row k. Define another function show_random_butterfly(data) that displays a randomly chosen butterfly.
Butterflies (like all organisms) can be divided into families. Each row of the dataset contains an entry called taxonomy, which is a single string containing words separated by commas. You can split the string at the commas and then strip the spaces to get a list of words. The family is the word in position 5 (counting from zero as usual). For example, the very first butterfly in the dataset is in the family "Noctuidae". Write a function family(data,k) which returns the family of the butterfly in row k.

MAS2008 Scientific Computing: Lab 6 The Python ecosystem

Instructions

Task 1: Spreadsheets

Task 2: Image manipulation

Task 3: Butterflies

MAS2008 Scientific Computing: Lab 6
The Python ecosystem