CS335 Information Retrieval Project
Due April 19 (any time).
Main#1) Obtain your corpus of documents for the semester. To do so, come up with 10 neutral (ie no controversy) queries (for example: Who was the 16th President?) that you will submit to your search engine. You are to then download the first 20 (non-controversial) webpage responses that the search engine returns with, for each of the 10 queries. There will be a total of 200 html files. (We will be discussing shortly how to process these using the Java Regex package. You may NOT use 3rd party code. You MUST write your own. You do not need regex necessarily but it does provide much more concise code.)
Main#2) Identify a Stoplist (either download or compute in a separate code on your own) and store it in a hash structure. (As mentioned later, you will need code to output your hash structure to a output text file.
Main#3) Compute One Inverted Index collectively storing info for All of the above files. See the following links for an explanation of what an Inverted Index is (and is not, ie forward index):
You are to use either hashmaps or hashtables (separate email will provide tutorial links) for storing the inverted index of your corpus. What information should you store in the inverted index? a) the word; b) the document found in; c) a vector specifying for each occurrence of the word in a document, how many words from beginning of document was it found (for this count include even the stopwords). You need to do this for every word in every document that is not a stopword.
Ancillary#1) Command Line Parsing.
As mentioned briefly in class, the project must run from the command line using the official oracle/sun jdk, which is bundled with the javac (compiler) and java (execution) commands. JDK and JAVA are available from Oracle/Sun at https://www.oracle.com/technetwork/java/javase/downloads/index.html For testing purposes (for yourself) as well as evaluation testing purposes (for your grade), the manner to facilitate this is by assuming that all initial data and all subsequent outputs are read from/written to text files, specified with the initial execution of your program at the command line. This involves incorporating “flags”. The following link is a straight to the point tutorial
http://journals.ecs.soton.ac.uk/java/tutorial/java/cmdLineArgs/parsing.html and the attached file has a number of examples of this concept. It is expected you adapt these concepts to your program project.
Ancillary#2) The ability to “query” your inverted index for such information as a) does a specific word appear in any document? b) how many documents (and which) does a given word appear in; c) how many times (frequency) does a word appear in a given document; d) printout of inverted index pertaining to a given document.
Phase 2 involves your corpus and ask you to
REQ1) implement Porter’s Algorithm (stemming) on the words in your inverted index and store the results in some structure.
REQ2) for both the inverted index and for Porter’ algorithm you should ensure Persistence. Persistence means that you do not want to have to compute your inverted index and now Porter’s algorithm more than once. The problem is that any time you exit your program you lose the data. So, you need persistence so that the data will exist even if program exited and then can be read in when program starts again. There are three common ways to persist: 1) write out to a text file, which you then read in at beginning of each program; 2) communicate with an actual database since the database can persist the data for you, IF you set things up correctly.
3) (see attached) Use Java Serialization. This allows for a Java Object to “wrap” around all your data and be outputted (and then inputted) as one.
For this semester, you can only use #1 or #3.
REQ3) The program should allow the user to decide whether s/he wants stemming as part of the search process by setting a switch/flag on command line.
REQ4) For each word in inverted index/For each document where word is found (ie nested double For-loop), calculate a “snippet” where the word is found. This may require you to revise your inverted index code to store more information in the inverted index for each word; this information would be the locations in document of each such word. These snippets will be displayed in Phase 3 in the results of your search engine.
You may use any found code for Porter’s Algorithm and you may use the attached source code if useful. In both cases, if you do, you need to COMMENT in code where exactly the code is from (even the URL). ALL OTHER CODE MUST BE WRITTEN BY YOU ALONE.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx