Java代写 | CMPT 456 Course Project 1
   本次Java代写是完成一个查询解析器来解析输入的查询文本
  
     CMPT 456 Course Project 1  
    
     codebase  
    
     is  
    
     a
    
     fork  
    
     from  
    
     Lucene/Solr  
    
     open  
    
     source  
    
     code  
    
     ●
    
     The purpose of having Lucene/Solr running inside a Docker container is to help you work  
    
     on this assignment using mostly any OS you prefer, Linux, Mac or Windows. If you are  
    
     curious about how the Docker container is built, look at the Dockerfile in the source code.  
    
     Project Data  
    
     We have included the data for you, within the codebase at location lucene/demo/data.  
    
     In the subsequent sections, you will use it in to demonstrate indexing and querying  
    
     process.  
    
     Compiling
 
    
     ●
    
     Build Docker image from the source code (make sure that we have. (i.e. current location)  
    
     at the end of the command):  
    
     docker build -t cmpt456-lucene-solr:6.6.7.  
    
     NOTE: Since Docker is not available free for Windows OS, we recommend you use VirtualBox with  
    
     Ubuntu OS or Windows Subsystem for Linux (WSL)  
    
     ●
    
     Run the Docker image we just built in order to activate the Docker container:  
    
     docker run -it cmpt456-lucene-solr:6.6.7  
    
     Demo  
    
     In this section, we help you to get familiar with Lucene basic components by running 2 simple  
    
     programs:  
    
     ●
    
     Index Files: this program uses standard analyzers to create tokens from input text files,  
    
     convert them to lowercase then filer out predefined list of stop-words.  
    
     The  
    
     source  
    
     code  
    
     is stored  
    
     in this file  
    
     within the  
    
     Search Files: this program uses a query parser to parse the input query text, then pass to  
    
     the index searcher to look for matching results.  
    
     The  
    
     source  
    
     code  
    
     is stored  
    
     in this file  
    
     within the  
    
     You are expected to run these examples, understand Lucene components used in the indexing  
    
     and querying process in order to make further extensions in the below programming tasks.  
    
     Text Parsing (30 pts)  
    
     In the first part of the assignment, you will learn how to use Lucene to build search capabilities  
    
     for documents in various formats, such as HTML, XML, PDF, Word. In fact, Lucene does not care  
    
     about the parsing of these and other document formats, and it is the responsibility of the  
    
     application using Lucene to use an appropriate parser to convert the original format into plain  
    
     text before passing that plain text to Lucene.  
    
     In the class IndexFiles.java within the Demo section, you can see that it indexes the content of  
    
     html files, including all html tags (e.g., <body>, <head>, <table>). In this section, we want you to  
    
     create a new class called HtmlIndexFiles.java to:
 
    
     ●
    
     ●
    
     Use a HTML parser to parse input files to extract the title and text content only of the  
    
     HTML files. Text content should not contain any HTML tags.  
    
     Use standard analyzers to create tokens from the result of parser, convert them to  
    
     lowercase then filter out based on a predefined list of stop-words (similar to the way  
    
     IndexFiles.java works)  
    
     Hint:  
    
     there  
    
     is  
    
     an  
    
     already  
    
     implemented  
    
     HTML  
    
     parser  
    
     in  
    
     this  
    
     class  
    
     org.apache.lucene.benchmark.byTask.feeds.DemoHTMLParser  
    
     Tokenization (30 pts)  
    
     In the second part of the assignment, you will experience how plain text passed to Lucene for  
    
     indexing goes through a process generally called tokenization. Tokenization is the process of  
    
     breaking input text into small indexing elements – tokens. The way input text is broken into  
    
     tokens heavily influences how people will then be able to search for that text.  
    
     As you have seen in the IndexFiles.java, we have used class StandardAnalyzer in order to control  
    
     the tokenization process. Look at its source code, you can see this class extends the  
    
     createComponents method to build a standard tokenization process to convert tokens to  
    
     lowercase then filer out based on a predefined list of stop-words.  
    
     In this section, we want you to create a class called CMPT456Analyzer.java to control the  
    
     tokenization process as follows:  
    
     ●
    
     Hint: Porter stemmer is already implemented in Lucene. Make use of it.  
    
     Similarity Metrics (40 pts)  
    
     In the last part of the assignment, you will have chance to touch one of the core modules of  
    
     querying process which is the ranking module. When user issues a query, Lucene will use index  
    
     created during the indexing process to look for matching documents. More importantly, these  
    
     matching documents will be sorted by a customizable ranking function before returning the final  
    
     results to the user.  
    
     Before asking you to implement a ranking function, we want you to make use of Lucene to  
    
     compute some basic metrics:  
    
     (TermFreq%28org.apache.lucene.index.Term%29)  
    
     Next, we want you to implement a custom ranking/similarity function base on TFIDFSimilarity  
    
     arity.html) provided by Lucene. In particular, you need to create a class called  
    
     CMPT456Similarity.java to support custom tf() and idf() as follows:  
    
     !
    
     /#  
    
     ꢀ
    
     ꢁ(ꢀ ∈ ꢂ) = (1 + ꢁꢃꢄꢅꢆꢄꢇꢈꢉ)  
    
     ꢂꢌꢈꢎꢌꢆꢇꢀ + 2  
    
     ꢊꢂꢁ(ꢀ) = 1 + ꢋꢌꢍ 5 ꢂꢌꢈꢏꢃꢄꢅ + 2  
    
CONTACT
 
                        Service Scope
                            C|C++|Java|Python|Matlab|Android|Jsp|Prolog|MIPS|Haskell|R|Linux|C#|PHP|SQL|
                            .Net|Hadoop|Processing|JS|Ruby|Scala|Rust|Data Mining|数据库|Oracle|Mysql|
                            Sqlite|IOS|Data Mining|网络编程|多线程编程|Linux编程|操作系统|
                            计算机网络|留学生|编程|程序|代写|加急|个人代写|作业代写|Assignment
                        
                     
