本次Python代写是在关系数据上定义数据仓库并对其他一些数据进行多维分析

CMPT 459 Assignment 2

Due: 11:59 pm, June 19, 2020

100 points in total

This assignment covers the materials in data warehousing and OLAP.

Question 1 (30 points)

This question is about the concept of data warehousing and OLAP.

You learn how to define a data warehouse on relational data. Let us try to extend the idea of

multi-dimensional analysis to some other kinds of data.

1. Consider a data set containing tweets, and assume the text is properly pre-processed. If

we use each keyword (token) as a dimension, use each tweet as a record, and use COUNT()

as the aggregate function, can you give 3 aggregate queries Q1, Q2, and Q3, such that Q1

is a roll-up of Q2, and Q3 is a drill-down of Q1? If there are 1000 different keywords in the

data set, in total how many cuboids are there in the data cube? (15 points)

2. Consider a set of photos enhanced by some attributes, such as location and time. If you

have an AI tool that can identify people in photos, can you suggest an interesting OLAP

query using people as a dimension? (5 points)

3. Now, consider a data set of photos extracted from newspapers and magazines. Suppose

for each photo, the caption, which is a brief description of the photo, is also extracted,

processed and stored. Can you propose 3 interesting OLAP queries that roll up and drill

down using both image and text information? What are the dimensions here? (10 points)

Question 2 (20 points)

This question is about bitmap index and bit-sliced index.

Canada has 13 provinces and territories. To support OLAP queries selecting all records in one or

a subset of provinces/territories, a straightforward way to build a bitmap index uses 13 bits per

record, one bit per province/territory. Can you design a way to use less bits? Describe your design

using two examples using a table T(Province_Territory, Sales). In the first example, calculate the

total sales in BC. In the second example, calculate the total sales in BC, ON, and NT altogether.

How many bits per record does your index need?

Question 3 (20 points)

This question is about implementing data cubes in big data architecture.

Please learn MapReduce techniques by yourself. For example, the following Map and Reduce

procedures transform a set of documents into an inverted index of keywords. The example is

explained in Part 1 of Indexing and Ranking, CMPT 456 <https://youtu.be/8asVyc56ks4>.

Consider a table T (D1, D2, D3, D4, M), where D1, D2, D3, and D4 are dimensions, and M is the

measure attribute.

1. Can you describe a way to compute the data cube using MapReduce? Please give the

Map procedure and the Reduce procedure. (15 points)

2. What is the communication cost of your method? That is, what is the total number of

key-value pairs the mappers emit? (5 points)

Question 4 (30 points)

This question is about multidimensional analysis in practice.

Download the University Advancement, Donations, and Giving data set from <

https://public.tableau.com/s/sites/default/files/media/advancement_donations_and_giving_d

emo.xls>. The data set is a table of donations made to universities in the United States. The

donation amounts and locations in this data set are not real as they are intended for training

purposes only. Ignore the attributes Gift Data and Prospect ID, use Gift Amount as the measure

attribute, and the other attributes as dimensions in this question. Use SUM() as the aggregate

function.

1. Which state has the largest total gift amount? Drill down to find out whether the large

total amount is due to one college that has an exceptionally large total amount or a city

that has an exceptional large total amount, or both. Report your findings and specify how

you measure whether a college or a city has an exceptional large (total) amount. What

statistic do you use? (10 points)

2. A tuple t1 is said a child of another tuple t if t generalizes a dimension of t1. For example,

t1 = (Alumni Association, New York, *, *, NY) is a child of t = (Alumni Association, *, *, *,

NY). Write a program to compute all pairs (t, t’) such that t’ is a child of t, and t.SUM() >=

3 x t’.SUM(). Describe your algorithm, provide your source code with proper comments,

and list your results. There are many implementations of data cube computation available

on the web, such as the Python program available at <

http://programmersought.com/article/23211379442/;jsessionid=1F49F81E3D95A8A13C

BB7071AB058296>. Feel free to build your solution on top of those programs, as long as

you make proper references in your submission. (20 points)

**程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB**

本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！

**E-mail:** [email protected] **微信:**itcsdx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。