Python代写 | FIT1045 Algorithms and programming S2-2020 Programming Assignment

本次Python代写是完成一个模块用来执行一些基本的数据科学任务

FIT1045 Algorithms and programming in Python, S2-2020
Programming Assignment

Overview
In this assignment we create a Python module to perform some basic data science tasks. While the instructions
contain some mathematics, the main focus is on implementing the corresponding algorithms and finding a
good decomposition into subproblems and functions that solve these subproblems. In particular, we want to
implement a method that is able to estimate relationships in observational data and make predictions based
on these relationships. For example, consider a dataset that measures the average life expectancy for various
countries together with other factors that potentially have an influence on the life expectancy, such as the
average body mass index (BMI), the level of schooling in that country, or the gross domestic product (GDP)
(see Table 1 for an example). Can we predict the life expectancy of Rwanda? And can we predict how the life
expectancy of Liberia will change if it improved its schooling to the level of Austria?
Country Adult Mortality Infant Deaths BMI GDP Schooling Life expectancy
Austria 10.77 4.32 25.4 4.55e+11 15.7 81.1
Georgia 18.23 15.17 27.1 1.76e+10 13.5 74.5
Liberia 29.95 74.52 24.0 3.264e+09 9.8 61.1
Mexico 30.73 17.29 28.1 1.22e+12 12.9 76.6
Rwanda 35.31 64.04 22.1 9.509e+09 10.8 ?
Table 1: Example dataset on Life Expectancy.
To answer these questions, we have to find the relationships between the target variable—in case of this example
the life expectancy—and the explanatory variables—in this example ”Adult Mortality”, ”Infant Deaths”, etc.
(see Figure 1). For this we want to develop a method that finds linear relationships between the explanatory
variables and the target variable and use it to understand those relationship and make predictions on the
target variable. This method is called a linear regression. The main idea of linear regression is to use data to
infer a prediction function that ’explains’ a target variable through linear effects of one or more explanatory
variables. Finding the relationship between one particular explanatory variable and the target variable (e.g.,
the relationship between schooling and life expectancy) is called a univariate regression. Finding the joint
relationship of all explanatory variables with the target variable is called a multivariate regression.
(a) Adult mortality vs. life expectancy. (b) Schooling vs. life expectancy.
Figure 1: Visualization of the relationship between explanatory variables and the target variable.
In this assignment, you will develop functions that perform univariate regression (Part I) and multivariate
regression (Part 2) and use them to answer questions on a dataset on life expectancy, similar to the example
above.
To help you to visually check and understand your implementation, a module for plotting data and linear
prediction functions is provided.
Part 1: Univariate Regression (10%, due in Week 6)
The first task is to model a linear relationship between one explanatory variable and the target variable.
The data in this assignment is always represented as a table containing m rows and n columns and a list
of length m containing the corresponding target variables. The table is represented as a list with m elements,
each being a list of n values, one for each explanatory variable.
An example dataset with one explanatory variable x and a target variable y would be
>>> x = [1,2,3]
>>> y = [1,4,6]
and an example dataset with two explanatory variables x
(1), x(2) would be
>>> data = [[1,1],[2,1],[1,3]]
>>> y = [3,4,7]
Task A: Optimal Slope (2 Marks)
Let us now start by modelling the linear relationship between one explanatory variable x and the target variable
y based on the data of m observations (x1, y1), . . . ,(xm, ym). For that, let’s start out simple by modelling the
relation of x and y as
y = ax , (1)
i.e., a straight line through the origin. For this model we want to find an optimal slope parameter a that fits
the given data as good as possible. A central concept to solve this problem is the residual vector defined as
r =


y1 − ax1
y2 − ax2
. . .
ym − axm

 ,
i.e., the m-component vector that contains for each data point the difference of the target value and the
corresponding predicted value. Intuitively, for a good fit, all components of the residual vector should have a
small magnitude (being all 0 for a perfect fit). A popular approach is to determine the optimal parameter value
a as the one that minimises the sum of squared components of the residual vector, i.e., the quantity
r · r = r
2
1 + r
2
2 + · · · + r
2
m ,
which we also refer to as the sum of squared residuals. With some math (that is outside the scope of this
unit) we can show that for the slope parameter a that minimises the sum of squared residuals it holds that the
explanatory data is orthogonal to the residual vector, i.e.,
x · r = 0 . (2)
Here, · refers to the dot product of two vectors, which in this case is
x · r = x1r1 + x2r2 + · · · + xmrm .
Pluging in the definition of r into Equation 2 yields
(x · x) a = x · y (3)
from which we can easily find the optimal slope a.
Instructions
Based on Equation 3, write a function slope(x, y) that computes the optimal slope for the simple regression
model in Equation 1.
Input: Two lists of numbers (x and y) of equal length representing data of an explanatory and a target variable.
Output: The optimal least squares slope (a) with respect to the given data.
For example:
4
def slope(x, y):
“””
Computes the slope of the least squares regression line
(without intercept) for explaining y through x.
>>> slope([0, 1, 2], [0, 2, 4])
2.0
>>> slope([0, 2, 4], [0, 1, 2])
0.5
>>> slope([0, 1, 2], [1, 1, 2])
1.0
>>> slope([0, 1, 2], [1, 1.2, 2])
1.04
“””
If you want to visualize the data and predictor, you can use the plotting functions provided in plotting.py.
For example
>>> X = [0, 1, 2]
>>> Y = [1, 1, 2]
>>> a = slope(X, Y)
>>> b = 0
>>> linear(a,b, x_min=0, x_max=2)
>>> plot2DData(X, Y)
>>> plt.show()
Figure 2: Example plot generated with the functions
provided in plotting.py.
will produce the plot given in Figure 2.
Task B: Optimal Slope and Intercept (3 Marks)
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
0.0
0.5
1.0
1.5
2.0
y = ax
y = ax + b
Figure 3: Linear regression with and without an intercept for three points (0, 1),(1, 1.2), and (2, 2).
Consider the example regression problem in Figure 3.
Even with an optimal choice of a the regression model
y = ax of Equation 1 does not fit the data well. In
particular the point (0, 1) is problematic because the
model given in Equation 1 is forced to run through
the origin and thus has no degree of freedom to fit
a data point with x = 0. To improve this, we have
to consider regression lines that are not forced to run
through the origin by extending the model equation
to
y = ax + b (4)
where in addition to the slope a we also have an additive term b called the intercept. Now we have two
parameters a and b to optimise, but it turns out that
a simple modification of Equation 3 lets us solve this
more complicated problem.
Instead of the residual being orthogonal to the explanatory data as is, we now require orthogonality to
the centred version of the data. That is, x · r = 0 where x denotes the centred data vector defined as
x =


x1 − µ
x2 − µ
. . .
xm − µ


with µ = (x1 + x2 + · · · + xm)/m being the mean value of the explanatory data. Again, we can rewrite the
5
above condition as a simple linear equation involving two dot-products
(x · x) a = x · y (5)
which again allows to easily determine the slope a. However, now we also need to find the optimal intercept b
corresponding to this value of a. For this we can simply find the average target value minus the average fitted
value when only using the slope (or in other words the average residual):
b = ((y1 − ax1) + · · · + (ym − axm))/m . (6)
Instructions
Using Equations (5) and (6) above, write a function line(x, y) that computes the optimal slope and intercept
for the regression model in Equation 4
Input: Two lists of numbers (x and y) of equal length representing data of an explanatory and a target variable.
Output: A tuple (a,b) where a is the optimal least squares slope and b the optimal intercept with respect to
the given data.
For example
>>> line([0, 1, 2], [1, 1, 2])
(0.5, 0.8333333333333333)