INFS3200 Data Pre-processing Assignment
Preliminary: Dataset Description
In this assignment, we have four datasets about book information from four different sources.
The data schemas are listed below:
Part 1: Data Schema Questions [6 marks]
Read the above schemas carefully and understand the meaning of each attribute. If you don’t
know the meaning of a certain attribute, check the data under it or Google its meaning (especially
for some abbreviations, like ISBN). Answer the following questions based on your
Question 1: Given that all four book datasets were stored in different relational databases,
(1) [1 marks] Can attribute authors be the primary key for Book1? Why?
(2) [1 marks] Given a query “Find the top 99 books which are sold the most, return their
ranks (sorted in ascending order), titles, PaperbackPrice, HardcoverPrice, EbookPrice,
AudiobookPrice of those books.”, which schema is capable of answering such query?
Write down the corresponding SQL query on that schema.
Question 2: Given that Book2 is stored in a distributed database A, and two queries that are
most frequently asked on A are:
• Find all books whose publisher name is “XXX” (or among multiple publishers),
return their book titles and author info.
• Find all books that are published in a given year, return their book IDs, languages
and number of pages.
Answer the following questions:
(1) [2 marks] If the goal of A is to handle each query by a dedicated local site, which
fragmentation strategy should be used to fragment Book2 table? If only two fragments
are generated, write their schemas (if vertically fragmented) or predicates (if horizontally
fragmented), respectively. (Note: there are lots of valid fragmentation solutions, just
provide one of them.)
(2) [2 marks] Assuming that we horizontally fragmented the table into three fragments
based on the following predicate:
Fragment 1: 1 ≤ publication_day ≤ 10
Fragment 2: 11 ≤ publication_day ≤ 20
Fragment 3: 21 ≤ publication_day ≤ 31
If we want to insert a new record into Book2, please explain the insert process in plain
English (you can use an example to demonstrate the process).
Part 2: Data Warehouse Design [7 marks]
In this part, we aim to design a data warehouse on the book sales system. Specifically, we
obtained the data from the given datasets and create a table which contains the total sales on each
publisher, each day and each language. An example table is shown as follows:
Day Publisher Language Sales
07/15/1984 AAAI Press English 11
05/05/1990 Springer International Publishing English 23
06/04/1995 Springer London English 15
12/11/2000 IEEE Computer Society Press English 30
04/03/2004 AAAI Press Spanish 2
05/01/2008 Springer International Publishing Spanish 13
11/19/2012 Springer London Spanish 5
08/06/2014 IEEE Computer Society Press Spanish 22
Question 3: Given the above example, answer the following questions:
(1) [1 marks] For each of the above four columns, identify if that column is a dimension
column or a fact column.
Question 4: Now we want to create bitmap indices for the given model:
(1) [2 marks] What are the advantages of building a bitmap index? Which type of
column is suitable for bitmap index?
(2) [2 marks] Suppose the “Publisher” column only contains four distinct values and
“Language” only contains two, which are all shown in the above example. Please
create bitmap indices for both “Publisher” and “Language”.
(3) [2 marks] Explain how to use the bitmap indices to find the total sales of “Spanish”
books published by “AAAI Press”.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx