CMPT 459 Assignment 2
Due: 11:59 pm, June 19, 2020
100 points in total
This assignment covers the materials in data warehousing and OLAP.
Question 1 (30 points)
This question is about the concept of data warehousing and OLAP.
You learn how to define a data warehouse on relational data. Let us try to extend the idea of
multi-dimensional analysis to some other kinds of data.
1. Consider a data set containing tweets, and assume the text is properly pre-processed. If
we use each keyword (token) as a dimension, use each tweet as a record, and use COUNT()
as the aggregate function, can you give 3 aggregate queries Q1, Q2, and Q3, such that Q1
is a roll-up of Q2, and Q3 is a drill-down of Q1? If there are 1000 different keywords in the
data set, in total how many cuboids are there in the data cube? (15 points)
2. Consider a set of photos enhanced by some attributes, such as location and time. If you
have an AI tool that can identify people in photos, can you suggest an interesting OLAP
query using people as a dimension? (5 points)
3. Now, consider a data set of photos extracted from newspapers and magazines. Suppose
for each photo, the caption, which is a brief description of the photo, is also extracted,
processed and stored. Can you propose 3 interesting OLAP queries that roll up and drill
down using both image and text information? What are the dimensions here? (10 points)
Question 2 (20 points)
This question is about bitmap index and bit-sliced index.
Canada has 13 provinces and territories. To support OLAP queries selecting all records in one or
a subset of provinces/territories, a straightforward way to build a bitmap index uses 13 bits per
record, one bit per province/territory. Can you design a way to use less bits? Describe your design
using two examples using a table T(Province_Territory, Sales). In the first example, calculate the
total sales in BC. In the second example, calculate the total sales in BC, ON, and NT altogether.
How many bits per record does your index need?
Question 3 (20 points)
This question is about implementing data cubes in big data architecture.
Please learn MapReduce techniques by yourself. For example, the following Map and Reduce
procedures transform a set of documents into an inverted index of keywords. The example is
explained in Part 1 of Indexing and Ranking, CMPT 456 <https://youtu.be/8asVyc56ks4>.
Consider a table T (D1, D2, D3, D4, M), where D1, D2, D3, and D4 are dimensions, and M is the
1. Can you describe a way to compute the data cube using MapReduce? Please give the
Map procedure and the Reduce procedure. (15 points)
2. What is the communication cost of your method? That is, what is the total number of
key-value pairs the mappers emit? (5 points)
Question 4 (30 points)
This question is about multidimensional analysis in practice.
Download the University Advancement, Donations, and Giving data set from <
emo.xls>. The data set is a table of donations made to universities in the United States. The
donation amounts and locations in this data set are not real as they are intended for training
purposes only. Ignore the attributes Gift Data and Prospect ID, use Gift Amount as the measure
attribute, and the other attributes as dimensions in this question. Use SUM() as the aggregate
1. Which state has the largest total gift amount? Drill down to find out whether the large
total amount is due to one college that has an exceptionally large total amount or a city
that has an exceptional large total amount, or both. Report your findings and specify how
you measure whether a college or a city has an exceptional large (total) amount. What
statistic do you use? (10 points)
2. A tuple t1 is said a child of another tuple t if t generalizes a dimension of t1. For example,
t1 = (Alumni Association, New York, *, *, NY) is a child of t = (Alumni Association, *, *, *,
NY). Write a program to compute all pairs (t, t’) such that t’ is a child of t, and t.SUM() >=
3 x t’.SUM(). Describe your algorithm, provide your source code with proper comments,
and list your results. There are many implementations of data cube computation available
on the web, such as the Python program available at <
BB7071AB058296>. Feel free to build your solution on top of those programs, as long as
you make proper references in your submission. (20 points)
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx