Analysing COVID-19 data
This assignment asks you to create a number of general functions to process a JSON file. You will also need
to create some tests and commit your work as you progress in a git repository.
The test file provided is not real and should work without difficulty, however real files will be provided that
include different levels of difficulty to the exercise (e.g., missing data, different number of elements on some
Your job consists of modifying/creating the functions so that the code works with the “simple” provided file,
while also being general enough to process other files that share the same schema but vary in data quality.
You’ll also need to write some specific tests and demonstrate your ability to use git version control and
The exercise will be semi-automatically marked, so it is very important that your solution adheres to the
correct file and folder name convention and structure, as defined in the rubric below. An otherwise valid
solution which doesn’t work with our marking tool will not be given credit.
First, we set out the problem we are solving. Next, we specify in detail the target for your solution. Finally,
to assist you in creating a good solution, we state the marks scheme we will use.
1 Our epidemiologist friend and his data problem
Jim is an epidemiologist at WHO. Lately he’s been having a lot of data to analyse from different regions
around the world. However, he never expected he would have to learn that much about computers. Thankfully, his good friend Carmen is amazing at writing scripts with Python and is afraid of no JSON file (however
big or complex they are!). There’s an additional difficulty though: WHO computers are very locked down!
As the data they handle contains personal information, the computers have limitations on what software
can be run on them and what data can be transferred to and from them. This means that Carmen will
have to help Jim write some scripts in plain Python (i.e., we don’t have access to NumPy or Pandas!), with
matplotlib being the only exception allowed. “Not to worry”, Carmen thinks, “if there’s anywhere I’m stuck
my friends at MPHY0021 can help me to solve these problems”.
Jim has been able to generate a fake sample file that he can get out of the WHO computers and send to
Carmen, so she can generate the functions that Jim can then load from a Jupyter notebook.
Inspecting the sample file, Carmen finds that it contains geographic and demographic information about
a particular region in the world, information about some age binning, and daily evolution data for many
measurements such as the number of people that have been hospitalised, tested or deceased. It also contains
information about the weather of that day and different actions taken by the government in that region.
Based on Jim’s needs, Carmen has created the skeleton of some functions and a notebook with how Jim is
expected to use them.
2 Your mission!
You are required to modify the provided Python file (process_covid.py) so that Jim’s notebook
(Jim.ipynb) works as expected. You will also need to add some tests (within test_process_covid.py)
using pytest and save it all in a git repository.
2.1 Reading JSON files
The first step we will need to solve is how to load the JSON files Jim is working with. Your goal is to create
a function that loads the content of these files into a Python object. You are free to change the structure of
the data after loading it if that makes any of the later easier. The function needs to be general enough to
read files with different names and on different locations in the file system.
Make sure to check that the input file follows a valid schema (i.e., it’s similar to the sample file), and throw
an error if it doesn’t.
2.2 Evolution of confirmed cases by age groups in terms of the total population
Jim needs to compare how the virus is spreading across the population for different age groups (i.e., the total
number of confirmed cases). To show the impact on these population groups it’s better to calculate these
values as a percentage of each population group.
There are two problems though. The first one is that not all regions are providing the confirmed cases
broken down into age groups. When this happens, we need to signal it properly so Jim understand what the
problem is. The second difficulty is that the age ranges provided to indicate the population distribution and
the number of cases confirmed are not always the same. To overcome this problem, we will rebin the data
from one dataset into the other, but only when it’s possible.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx