Java代写 | Project 1 Inverted Index ASSOCIATED ASSIGNMENTS


Guides / Project Guides / Project 1 Inverted Index
Project 1 Inverted Index
Project 1 Code Review
For this project, you will write a Java program that processes all text Ales in a directory and its
subdirectories, cleans and parses the text into word stems, and builds an in-memory inverted
index to store the mapping from word stems to the documents and position within those
documents where those word stems were found.
For example, suppose we have the following mapping stored in our inverted index:
“capybara”: {
“input/mammals.txt”: [
“platypus”: {
“input/dangerous/venomous.txt”: [
“input/mammals.txt”: [
This indicates that after processing the word stems from Ales, the word capybara is found in
the Ale input/mammals.html in position 11. The word platypus is found in two Ales,
input/mammals.html and input/dangerous/venomous.html . In the Ale
input/mammals.html , the word platypus appears twice in positions 3 and 8 . In Ale
input/dangerous/venomous.html , the word platypus is in position 2 in the Ale.
The process of stemming reduces a word to a base form (or “stem”), so that words like
interesting , interested , and interests all map to the stem interest . Stemming
is a common preprocessing step in many web search engines.
The core functionality of your project must satisfy the following requirements:
Process command-line arguments to determine the input to process and output to
produce. See the Input and Output sections below for speciAcs.
Create a custom inverted index data structure that stores a mapping from a word stem to
the Ale path(s) the word was found, and the position(s) in that Ale the word is located. The
positions should start at 1. This will require nesting multiple built-in data structures.
If provided a directory as input, And all Ales within that directory and all subdirectories and
parse each text Ale found. Any Ales that end in the .text or .txt extension (case
insensitive) should be considered a text Ale. If provided a single Ale as input, only parse that
individual Ale.
Use the UTF-8 character encoding for all Ale processing, including reading and
Process text Ales into word stems by removing any non-letter symbols (including digits,
punctuation, accents, special characters), convert the remaining alphabetic characters to
lowercase, split the text into words by whitespace, and then stem the word using the
Apache OpenNLP toolkit.
Use the regular expression (?U)[^\\p{Alpha}\\p{Space}]+ to remove special
characters from text.
Use the regular expression (?U)\\p{Space}+ to split text into words by whitespace.
Use the SnowballStemmer English stemming algorithm in OpenNLP to stem words.
If the appropriate command-line arguments are provided, output the inverted index in pretty
JSON format. See the Output section below for speciAcs.
Output user-friendly error messages in the case of exceptions or invalid input. Under no
circumstance should your main() method output a stack trace to the user!
The functionality of your project will be evaluated with the group of
JUnit tests.
Your main method must be placed in a class named Driver . The Driver class should
accept the following command-line arguments:
-path path where the Wag -path indicates the next argument is a path to either a
single text Ale or a directory of text Ales that must be processed and added to the inverted
-index path where the Wag -index is an optional Wag that indicates the next
argument is the path to use for the inverted index output Ale. If the path argument is not
provided, use index.json as the default output path. If the -index Wag is not
provided, do not produce an output :le.
The command-line Wag/value pairs may be provided in any order. Do not convert paths to
absolute form when processing command-line input!
All output will be produced in “pretty” JSON format using 2 space characters for indentation.
According to the JSON standard, numbers like integers should never be quoted. Any string or
object key, however, should always be surrounded by ” quotes. Objects (similar to maps)
should use curly braces { and } and arrays should use square brackets [ and ] . Make
sure there are no trailing commas after the last element.
The paths should be output in the form they were originally provided. The tests use normalized
relative paths, so the output should also be normalized relative paths. As long as command-line
parameters are not converted to absolute form, this should be the default output provided by the
path object.
The contents of your inverted index should be output in alphabetically sorted order as a nested
JSON object using a “pretty” format. For example:
“capybara”: {
“input/mammals.txt”: [
“platypus”: {
“input/dangerous/venomous.txt”: [
“input/mammals.txt”: [
This output should look similar to that of one of your homework assignments… you might be
able to use it directly depending how you setup your project code!
The following are a few examples (non-comprehensive) to illustrate the usage of the commandline
arguments. Consider the following example:
java Driver -path “../../project-tests/Project Tests/input/text/simple/hello.txt”
-index index-text-simple-hello.json
! The project tests account for different path separators (forward slash / for
Linux/Mac systems, and backward slash \ for Windows systems). Your code does
not have to convert between the two!
The above arguments indicate that Driver should build an inverted index from the single
hello.txt Ale in the input/text/simple subdirectory of the Project Test’s directory, and
output the inverted index as JSON to the index-simple-hello.json Ale in the current
working directory.
The above arguments indicate that Driver should build an inverted index from all of the text
Ales found in the text/simple subdirectory of the Project Test’s directory, and output the
inverted index as JSON to the default path index.json in the current working directory.
The above arguments indicate that Driver should build an inverted index from all of the
HTML Ales found in the input/text/simple subdirectory of the Project Test’s directory, but
it should NOT produce an output Ale. It must still build the inverted index however! (This will be
useful in the future when we add the ability to search.)