数据库代写 | Assignment 2 – Hadoop and Hbase – Distributed Systems

本次澳洲代写主要为Hadoop&hbase 相关的assignment

Assignment Overview
In this assignment, you are asked to write a report to describe how the following tasks are done.
(1) Fragment the tweet objects in 4tweets.json and distribute the fragments in the specified Hadoop nodes.
(2) Write Map/Reduce algorithms to compute query results for the query you select.
(3) Design Hbase tables to store the tweet objects in HBase.
(4) Write an Hbase algorithm to compute query result in Habse.

This type of report supplies critical information to help an organization to understand and decide whether a new system like
Hadoop or Hbase is suitable for its business. This assignment aims to conduct analysis for this purpose.
The materials that you use include Week8 lecture slides, Week9 tutorial, the recommended videos, and other internet
resources and/or books that you can find. Note that you must NOT copy these materials; otherwise, you commit plagiarism
and the university formal plagiarism procedure will apply.
You complete the assignment together with two other students whom you choose yourself (in groups of three). If you do the
assignment by your own or in a group of two, you still complete all tasks correctly to get full marks.
You must complete the tasks described in the next section in order. Otherwise, your discussion and references will be
ambiguous.
Make sure that the file people.txt is edited to have correct information.
One submission from a group is sufficient for all group members to receive marks. Group members receive same marks
unless the group has an alternative agreement.

Application and requirements
twitterdata.zip contains a large number of twitter tweets. This file is called the twitterdata dataset. Four tweets from this
dataset were sampled, simplified, and are put in the file 4tweets.json. The simplified tweets are called tweet objects. Each
tweet object is either an original tweet, or a retweet. If it is a retweet, the source tweet (either original or the preceding
retweet) is in the retweeted_status attribute.
1. Assume that a Hadoop system has three slave nodes identified by n1, n2, and n3 and all the nodes belong to one rack.
You partition the tweets objects in 4tweets.json into 3 fragments (without breaking any tweet object) and decide the
nodes to store the fragments. Each fragment should be stored to have 2 copies.