2023年11月25日

Java代写 | CMPT 456 Course Project 1

本次Java代写是完成一个查询解析器来解析输入的查询文本

CMPT 456 Course Project 1

codebase

fork

from

Lucene/Solr

open

source

code

●

The purpose of having Lucene/Solr running inside a Docker container is to help you work

on this assignment using mostly any OS you prefer, Linux, Mac or Windows. If you are

curious about how the Docker container is built, look at the Dockerfile in the source code.

Project Data

We have included the data for you, within the codebase at location lucene/demo/data.

In the subsequent sections, you will use it in to demonstrate indexing and querying

process.

Compiling

●

Build Docker image from the source code (make sure that we have. (i.e. current location)

at the end of the command):

docker build -t cmpt456-lucene-solr:6.6.7.

NOTE: Since Docker is not available free for Windows OS, we recommend you use VirtualBox with

Ubuntu OS or Windows Subsystem for Linux (WSL)

●

Run the Docker image we just built in order to activate the Docker container:

docker run -it cmpt456-lucene-solr:6.6.7

Demo

In this section, we help you to get familiar with Lucene basic components by running 2 simple

programs:

●

Index Files: this program uses standard analyzers to create tokens from input text files,

convert them to lowercase then filer out predefined list of stop-words.

The

source

code

is stored

in this file

within the

Search Files: this program uses a query parser to parse the input query text, then pass to

the index searcher to look for matching results.

The

source

code

is stored

in this file

within the

You are expected to run these examples, understand Lucene components used in the indexing

and querying process in order to make further extensions in the below programming tasks.

Text Parsing (30 pts)

In the first part of the assignment, you will learn how to use Lucene to build search capabilities

for documents in various formats, such as HTML, XML, PDF, Word. In fact, Lucene does not care

about the parsing of these and other document formats, and it is the responsibility of the

application using Lucene to use an appropriate parser to convert the original format into plain

text before passing that plain text to Lucene.

In the class IndexFiles.java within the Demo section, you can see that it indexes the content of

html files, including all html tags (e.g., <body>, <head>, <table>). In this section, we want you to

create a new class called HtmlIndexFiles.java to:

●

Use a HTML parser to parse input files to extract the title and text content only of the

HTML files. Text content should not contain any HTML tags.

Use standard analyzers to create tokens from the result of parser, convert them to

lowercase then filter out based on a predefined list of stop-words (similar to the way

IndexFiles.java works)

Hint:

there

already

implemented

HTML

parser

this

class

org.apache.lucene.benchmark.byTask.feeds.DemoHTMLParser

Tokenization (30 pts)

In the second part of the assignment, you will experience how plain text passed to Lucene for

indexing goes through a process generally called tokenization. Tokenization is the process of

breaking input text into small indexing elements – tokens. The way input text is broken into

tokens heavily influences how people will then be able to search for that text.

As you have seen in the IndexFiles.java, we have used class StandardAnalyzer in order to control

the tokenization process. Look at its source code, you can see this class extends the

createComponents method to build a standard tokenization process to convert tokens to

lowercase then filer out based on a predefined list of stop-words.

In this section, we want you to create a class called CMPT456Analyzer.java to control the

tokenization process as follows:

●

Hint: Porter stemmer is already implemented in Lucene. Make use of it.

Similarity Metrics (40 pts)

In the last part of the assignment, you will have chance to touch one of the core modules of

querying process which is the ranking module. When user issues a query, Lucene will use index

created during the indexing process to look for matching documents. More importantly, these

matching documents will be sorted by a customizable ranking function before returning the final

results to the user.

Before asking you to implement a ranking function, we want you to make use of Lucene to

compute some basic metrics:

(TermFreq%28org.apache.lucene.index.Term%29)

Next, we want you to implement a custom ranking/similarity function base on TFIDFSimilarity

arity.html) provided by Lucene. In particular, you need to create a class called

CMPT456Similarity.java to support custom tf() and idf() as follows:

ꢀ

ꢁ(ꢀ ∈ ꢂ) = (1 + ꢁꢃꢄꢅꢆꢄꢇꢈꢉ)

ꢂꢌꢈꢎꢌꢆꢇꢀ + 2

ꢊꢂꢁ(ꢀ) = 1 + ꢋꢌꢍ 5 _ꢂ_ꢌ_ꢈ_ꢏ_ꢃ_ꢄ_ꢅ₊₂

程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB

CS代写,留学生编程代写,CS作业代写,Java代写,程序代写，代码代写 | ITCS代写

本网站支持淘宝支付宝微信支付 paypal等等交易。如果不放心可以用淘宝交易！

E-mail:itcsdx@outlook.com 微信:itcsdx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。

分布式计算算法代写 | COMP 4001 DISTRIBUTED COMPUTING 数据结构代写 | CS 61BL Data Structures & Programming Methodology

CONTACT

Assignment Example

Service Scope

Recent Case

2024年10月8日

MySQL数据库学习指南：留学生如何在不同国家的课程和就业形势下脱颖而出

itcs

2024年9月19日

北美计算机留学高校整理与热门专业前景分析

itcs

2024年9月10日

留学生计算机代写常见服务有哪些？

itcs

2024年9月4日

留学生程序代写靠谱吗

itcs

2024年9月4日

留学生如何选择机器学习方向的专业

itcs

Java代写 | CMPT 456 Course Project 1

CONTACT

Assignment Example

Service Scope

Recent Case

MySQL数据库学习指南：留学生如何在不同国家的课程和就业形势下脱颖而出

北美计算机留学高校整理与热门专业前景分析

留学生计算机代写常见服务有哪些？

留学生程序代写靠谱吗

留学生如何选择机器学习方向的专业

Tags