During this assignment, you are going to deal with a real-world data integration and data quality
challenge, including answering a series of questions to demonstrate your level of understanding
on these topics as well as your abilities of problem-solving and implementation.
Tips & Suggestions:
1. It is highly recommended to complete Prac 3 before working on the implementation
section of this assignment (Part 4). The assignment is independent to Prac 3, but you will
benefit a lot from the Prac 3 solutions in terms of the coding part.
2. Each dataset used in this assignment contains thousands of records, which is hard to be
checked record-by-record manually. Therefore, it is recommended to have a handy text
editor tool (e.g. Microsoft Excel, Notepad++ or Sublime Text on Windows) to view and
search the contents in CSV files. Please fully utilize the search functionality (usually is
CTRL+F) in text editor to look for certain values, tuples or characters. Also, please avoid
changing the data unintentionally while viewing or searching as it may affect your
assignment results.
3. The programming language is not limited to Java or Python, choose the one you feel
comfortable with and stick to it until you finish the assignment. The code must contain
basic comments so that tutors are able to understand the structure of your code and the
objective of each snippet.

Please format your document nicely, in terms of consistent font, font size and spacing. The
answers are suggested to follow the below structure (No need to repeat questions if not necessary,
fonts and spacing are not limited):

Part 1.
Question 1: Your answers…
Question 2: Your answers…
Part 2.

WARNING: Please complete this assignment individually. Any type of answer-sharing among
classmates is not acceptable and, once identified, will be penalized.

Preliminary: Dataset Description

In this assignment, we have four datasets about book information from four different sources.
The data schemas are listed below:




