Java代写 | FIT9131 Semester 1 2020 Assignment 2 Koala Rescue

本次北美CS代写之Python Big Data 代写主要是关于Hadoop的,要求在Cloudera集群上熟悉并使用Python开发文本处理和MapReduce编程模型。

Assignment 2 CS 4417

Due: March 8
Goal: The goal of this assignment is to gain familiarity and practical experience with index

development for text processing and with the MapReduce programming model.
Programming Language: You may use either Java or Python. Python is strongly recommended

Part 0: VM Setup

The Department of Computer Science has a cluster where Cloudera has been installed. Cloudera is a company that provides Apache Hadoop. Each student will be provided with a virtual machine (VM) that is hosted on the cluster.

Each student needs to get a SSH key from the TA, Brett Douglas Davis. For example, a key may look like this:

cs4417-lab-1.pem

You should save the key. You should change the permissions so that it is user read-only. The command for doing so is the following:

chmod 700 cs4417-lab-1.pem

You can now use the key to ssh into the VM:

ssh -i cs4417-lab-1.pem [email protected]

Part 1: Inverted Index (25 points)

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries.

In class we discussed variations of inverted indices. Your goal is to build an inverted index that supports the queries described in Part 2. This means that you will need a positional inverted index that maps a word to locations in a set of documents.

You may use either Python or Java. We strongly recommend the use of Python. Your inverted index should have these characteristics:

All words in the index should be lower case. This means that you need to convert all input words to lower case.

No punctuation, numbers, or symbols should be represented in the index.
These stop words should not be included in the index: and, but, is, the, to. You may use any method you want to support this.

Your program should assume a set of files. You may assume that all of these files are in a directory. Although stopwords are not be used in the index, it should not affect the position of the other words. For example, assume that your file consists of the words The Game of Thrones. The word Game is at position 2 and the word Thrones is at position 4.

Part 2: Querying Inverted Index (25 points)

Write a query program that queries your inverted index. You should be able to support the following:

· Boolean search queries which return documents that satisfies condition specified. You should be able to support AND and OR as well as a combination of the two.

· For any word provided by a user, return the files with the word and for each instance a word appears in a file provide the position within the file.

Part 3: Inverted Index – Map Reduce (30 points)

You are now to create an inverted index using MapReduce. You should be able to support Boolean search queries.

Part 4: Writeup (20 points)

Please answer the following question

Describe the design you used for Parts 1, 2, 3. For Part 3 this should include the keys and values you used.

For parts 1 and 2, what else would you have done in the inverted index implementation, given more time, energy, resources, etc.

For parts 1,2 how difficult was it to implement the inverted index? How difficult would it be to implement another task, given this experience? What would be straightforward? What would take more time?

For part 3, how many passes of Mapreduce did you use?

Part 4: Evaluation

Part of our evaluation of Parts 1, 2, 3 is through testing. Our test cases will NOT be made available to you before submission. It is your responsibility to test extensively.

Part 5: Deliverables

Submissions will only be accepted through OWL. Your submission must include:

  1. All code
  2. A pdf of part 4.

IMPORTANT: Keep a copy of assignment outside of the VM.

Hints:

  1. The Python collections module implements high-performance container datatypes and contains many useful data structures that you can use to store information in memory. You might find defaultdict to be useful. Java also provides support for a dictionary with the Java.util.Dictionary class.
  2. In Part 1 you will create an inverted index. You should be able to write this to disk. The contents of this file will need to be read into memory for query analysis (Part 2). You might find Python’s pickle and json modules useful.
  3. The Hadoop commands are a bit tedious. Useful mechanisms include Makefile or shell scripts that take parameters.