SQL代写 | User-centric Systems for Data Science Assignment 2
In this assignment you will extend the operator library you built for Assignment 1 to support transparent
provenance tracking. The resulting library will enable users to retrieve various types of
provenance-related information for individual tuples, such as lineage, Where- and How-provenance. The
last assignment task focuses on the concept of data responsibility, which we will discuss in Lecture 6.
For this and future assignments, you will need to have Python 3.7+, Pytest, and Ray installed in your
machine (cf. Section 10 “Resources” for more information).
You must follow the code skeleton provided in the Gitlab repository. Inline comments will help you
identify the parts of the code you need to fill in. Keep in mind that the assignment does not require writing
much code: the logic of each data operator can be implemented in less than 20 LOC. Always keep your
code simple and well documented.
We will be using a mix of real and synthetic data. Real data include movie ratings from a large Netflix
dataset whereas friendship relationships between users are synthetic. The input data are available in the
Gitlab repository. Make sure you understand the data format first (cf. Section 1 “Data schema”). You
might also want to create a toy dataset of the same format to test your code easily.
1. Data schema
The data we will use for this assignment consist of two CSV files: Friends and Ratings. The former
contains friendship relationships as tuples of the form UserID1 UserID2, denoting two users who are
also friends. A user can have one or more friends and the friendship relationship is symmetric: if A is a
friend of B, then B is also a friend of A and both tuples (A B and B A) are present in the file. Ratings
contains user ratings as tuples of the form UserID MovieID Rating. For example, tuple 12 3 4
means that “the user with ID 12 gave 4 stars to the movie with ID 3”.
Hint #1: You can use Python’s CSV reader to parse input files.
Hint #2: Consider encapsulating your Python tuples in ATuple objects (see code skeleton).
2. TASK I: Implement backward tracing (credits: 40/100)
The first task is to extend the operators you built in Assignment 1 with support for backward tracing. For
each operator, you will have to implement a new method (in Python 3 syntax):
lineage(tuples: List[ATuple]) -> List[List[ATuple]]
that returns the lineage of the given list of tuples.
As discussed in Lecture 2, the lineage of an output tuple, let t, with respect to a query q(D) is the
collection of input tuples that contributed to having the tuple t in the output of the query. Let
recommendation be the output (movie id) of the second query from Assignment 1:
FROM ( SELECT R.MID, AVG(R.Rating) as score
FROM Friends as F, Ratings as R
WHERE F.UID2 = R.UID
AND F.UID1 = ‘A’
ORDERBY score DESC
LIMIT 1 )
To successfully complete this task, you must implement a new method for ATuple:
lineage() -> List[ATuple]
so that you can retrieve the lineage of any recommendation as follows:
lineage = recommendation.lineage()
Calling recommendation.lineage() should internally call:
where operator is a handle to the operator that produced the tuple recommendation (i.e. the root
operator of the query tree).