2023年11月25日

Python代写大数据Hadoop | CS 4417 Assignment 2

本次big data代写主要是使用python在Hadoop上进行数据处理

Part 0: VM Setup

The Department of Computer Science has a cluster where Cloudera has been installed. Cloudera is a company that provides Apache Hadoop. Each student will be provided with a virtual machine (VM) that is hosted on the cluster.

To use the VM you need to retrieve the ssh key that is in your OWL dropbox folder. An example key looks like this:

cs4417-lab-xxxxxx-pem

where xxxxxx represents your email identifier.

You need to change the permissions so that it is user read-only. The command for doing so on a Mac or Linux machine is the following:

chmod 600 cs4417-lab-xxxxxx.pem

You can now use the key to ssh into the VM:

ssh -i cs4417-lab-xxxxxx-pem xxxxxx@cs4417-lab-xxxxxx.pem

You should get the following prompt:[cs4417]>

You should enter ssh cloudera@xxxxxx

The software you need to complete the assignment is on the VM. You do not need to install anything.

Part 1: Calculate the frequency of a term in each document (20 points)

Given a set of documents, calculate the frequency of a term in each document. The output should be the term, document and number of occurrences of the term in the document. This is different from the example presented in the lectures in that the example focused on one document. In this example, the final output should consist of pairs in the following form:

((term, document identifier), count)

Submission: You should submit a zip file with the name Part1.zip. When unzipped there should be two files: mapper.py and reducer.py.

Part 2: Count Bigrams (15 points)

Take the word count example and extend it to count bigrams which refers to sequences of two consecutive words.

You should make use of Hadoop for this part.

Submission: You should submit a zip file with the name Part2.zip. When unzipped there should be two files: mapper.py and reducer.py.

Part 3: Count Unique Bigrams (15 points)

This is an extension of part 2 where you count the number of unique bigrams. One approach is to use two MapReduce passes. The first is what you did for Part 2 and the second is something you need to develop.

Submission: You should submit a zip file with the name Part3.zip. When unzipped there should be two files for each MapReduce pass, i: mapperi.py and reduceri.py. For example, if i is 1 then you should have mapper1.py and reducer1.py and if i is 2 then you should have mapper2.py and reducer2.py.

Part 4: Term-Frequency-Inverse Document Frequency in MapReduce (40 points)

The tf-idf metric is used to determine the importance of a word within a document. You are to write a program that uses the MapReduce paradigm. However, you do not have to use Hadoop for doing so. It is sufficient to use pipes to test the program.

The formula of tf-idf for document d and term t is the following:

tf-id_t,d = tf_t,d /N *log₁₀(D/ df_t)

where tf_t,d. is the number of occurrences of the term t in document d, N is the total number of words in document d, D is the total number of documents and df_tis the number of documents that the term t occurs in. There are variations of the formula, but you should use the above formula since our test cases assume the above.

This requires multiple MapReduce jobs (more than 2). The first MapReduce (MR) job should calculate the term count for each term and document (tf_t,d).

The second MR job should calculate df_tfor each term.

You should figure out the rest of the MR jobs needed.

For this assignment the number of documents is needed. You should have a file called inputParameters. This file should have one number which represents the number of documents. This will make it easier for the TAs to test.

Submission: You should submit a zip file with the name Part3.zip. When unzipped there should be two files for each MapReduce pass, i: mapperi.py and reduceri.py. For example, if i is 1 then you should have mapper1.py and reducer1.py and if i is 2 then you should have mapper2.py and reducer2.py.

Part 4: Writeup (10)

Please complete the following:

For each part describe the input and output for each MR job.

Please answer the following questions
What would you like to have done given more time?
How difficult was it to implement? How difficult would it be to implement another task, given this experience? What would be straightforward? What would take more time?

IMPORTANT: Keep a copy of the assignment outside of the VM.

程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB

CS代写,留学生编程代写,CS作业代写,Java代写,程序代写，代码代写 | ITCS代写

本网站支持淘宝支付宝微信支付 paypal等等交易。如果不放心可以用淘宝交易！

E-mail:itcsdx@outlook.com 微信:itcsdx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。

java代写平台选择的时候要注意什么事情 C++代写 | COSC1076 Advanced Programming Assignment 1

CONTACT

Assignment Example

Service Scope

Recent Case

2024年10月8日

ITCS代写

Python代写大数据Hadoop | CS 4417 Assignment 2

CONTACT

Assignment Example

Service Scope

Recent Case

MySQL数据库学习指南：留学生如何在不同国家的课程和就业形势下脱颖而出

北美计算机留学高校整理与热门专业前景分析

留学生计算机代写常见服务有哪些？

留学生程序代写靠谱吗

留学生如何选择机器学习方向的专业

Tags