Java代写 | Project 1 Inverted Index ASSOCIATED ASSIGNMENTS

本次Java代写是完成目录中文本索引

Guides / Project Guides / Project 1 Inverted Index
Project 1 Inverted Index
ASSOCIATED ASSIGNMENTS
Project 1 Functionality
Project 1 Code Review
For this project, you will write a Java program that processes all text Ales in a directory and its
subdirectories, cleans and parses the text into word stems, and builds an in-memory inverted
index to store the mapping from word stems to the documents and position within those
documents where those word stems were found.
For example, suppose we have the following mapping stored in our inverted index:
{
“capybara”: {
“input/mammals.txt”: [
11
]
},
“platypus”: {
“input/dangerous/venomous.txt”: [
2
],
“input/mammals.txt”: [
3,
8
]
}
}
This indicates that after processing the word stems from Ales, the word capybara is found in
the Ale input/mammals.html in position 11. The word platypus is found in two Ales,
input/mammals.html and input/dangerous/venomous.html . In the Ale
input/mammals.html , the word platypus appears twice in positions 3 and 8 . In Ale
input/dangerous/venomous.html , the word platypus is in position 2 in the Ale.
The process of stemming reduces a word to a base form (or “stem”), so that words like
interesting , interested , and interests all map to the stem interest . Stemming
is a common preprocessing step in many web search engines.
Functionality
The core functionality of your project must satisfy the following requirements:
Process command-line arguments to determine the input to process and output to
produce. See the Input and Output sections below for speciAcs.
Create a custom inverted index data structure that stores a mapping from a word stem to
the Ale path(s) the word was found, and the position(s) in that Ale the word is located. The
positions should start at 1. This will require nesting multiple built-in data structures.
If provided a directory as input, And all Ales within that directory and all subdirectories and
parse each text Ale found. Any Ales that end in the .text or .txt extension (case
insensitive) should be considered a text Ale. If provided a single Ale as input, only parse that
individual Ale.
Use the UTF-8 character encoding for all Ale processing, including reading and
writing.
Process text Ales into word stems by removing any non-letter symbols (including digits,
punctuation, accents, special characters), convert the remaining alphabetic characters to
lowercase, split the text into words by whitespace, and then stem the word using the
Apache OpenNLP toolkit.
Use the regular expression (?U)[^\\p{Alpha}\\p{Space}]+ to remove special
characters from text.
Use the regular expression (?U)\\p{Space}+ to split text into words by whitespace.
Use the SnowballStemmer English stemming algorithm in OpenNLP to stem words.
If the appropriate command-line arguments are provided, output the inverted index in pretty
JSON format. See the Output section below for speciAcs.
Output user-friendly error messages in the case of exceptions or invalid input. Under no
circumstance should your main() method output a stack trace to the user!
The functionality of your project will be evaluated with the Project1Test.java group of
JUnit tests.
Input
Your main method must be placed in a class named Driver . The Driver class should
accept the following command-line arguments:
-path path where the Wag -path indicates the next argument is a path to either a
single text Ale or a directory of text Ales that must be processed and added to the inverted
index
-index path where the Wag -index is an optional Wag that indicates the next
argument is the path to use for the inverted index output Ale. If the path argument is not
provided, use index.json as the default output path. If the -index Wag is not
provided, do not produce an output :le.
The command-line Wag/value pairs may be provided in any order. Do not convert paths to
absolute form when processing command-line input!
Output
All output will be produced in “pretty” JSON format using 2 space characters for indentation.
According to the JSON standard, numbers like integers should never be quoted. Any string or
object key, however, should always be surrounded by ” quotes. Objects (similar to maps)
should use curly braces { and } and arrays should use square brackets [ and ] . Make
sure there are no trailing commas after the last element.
The paths should be output in the form they were originally provided. The tests use normalized
relative paths, so the output should also be normalized relative paths. As long as command-line
parameters are not converted to absolute form, this should be the default output provided by the
path object.
The contents of your inverted index should be output in alphabetically sorted order as a nested
JSON object using a “pretty” format. For example:
{
“capybara”: {
“input/mammals.txt”: [
11
]
},
“platypus”: {
“input/dangerous/venomous.txt”: [
2
],
“input/mammals.txt”: [
3,
8
]
}
}
This output should look similar to that of one of your homework assignments… you might be
able to use it directly depending how you setup your project code!
Examples
The following are a few examples (non-comprehensive) to illustrate the usage of the commandline
arguments. Consider the following example:
java Driver -path “../../project-tests/Project Tests/input/text/simple/hello.txt”
-index index-text-simple-hello.json
! The project tests account for different path separators (forward slash / for
Linux/Mac systems, and backward slash \ for Windows systems). Your code does
not have to convert between the two!
The above arguments indicate that Driver should build an inverted index from the single
hello.txt Ale in the input/text/simple subdirectory of the Project Test’s directory, and
output the inverted index as JSON to the index-simple-hello.json Ale in the current
working directory.
The above arguments indicate that Driver should build an inverted index from all of the text
Ales found in the text/simple subdirectory of the Project Test’s directory, and output the
inverted index as JSON to the default path index.json in the current working directory.
The above arguments indicate that Driver should build an inverted index from all of the
HTML Ales found in the input/text/simple subdirectory of the Project Test’s directory, but
it should NOT produce an output Ale. It must still build the inverted index however! (This will be
useful in the future when we add the ability to search.)
Hints
It is important to develop the project iteratively. In fact, you may already have certain
components complete thanks to the homework assignments. One possible breakdown of tasks
are:
Create code that handles parsing command-line arguments into Wag/value pairs, and
supports default values if a Wag is missing a value.
Create code that is able to traverse a directory and return a list of all the text Ales found
within that directory. The File I/O (Featuring NIO.2) tutorials might be useful for this.
Create code that is able to parse text into words, including converting that text to
lowercase, replacing special characters and digits, splitting that text into words by
whitespaces, and Anally stemming the words.
Create code that handles storing a word, Ale path, and location into an inverted index data
structure.
Create code that is capable of writing a nested data structure (matching your inverted index
data structure) to a Ale in JSON format.
Test each part of your code. Keep in mind the tests provided only test the Anal output of your
Java program. You are responsible for testing the individual components of your code.
Once you have the above components working, then you can work on integrating your code
together. This part must still be done iteratively! Do not try to combine all of the code at once.
First, work on being able to traverse a directory based on command-line parameters. Then, clean
the text Ales found and output the cleaned words to the console. Finally, add those words to the
index and write that index as JSON.
java Driver -path “../../project-tests/Project Tests/input/text/simple/” -index
java Driver -path “../../project-tests/Project Tests/input/text/simple/”
! These hints may or may not be useful depending on your approach. Do not be
overly concerned if you do not And these hints helpful for your approach for this
project.


程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB


本网站支持淘宝 支付宝 微信支付  paypal等等交易。如果不放心可以用淘宝交易!

E-mail: [email protected]  微信:itcsdx


如果您使用手机请先保存二维码,微信识别。如果用电脑,直接掏出手机果断扫描。

发表评论