COM3110 Text Processing (2020/21)
Assignment: Document Retrieval
Various choices are made in preprocessing documents before indexation (e.g. whether a stoplist
is used, whether terms are stemmed, etc) with various consequences (e.g. for the effectiveness
of retrieval, the size of the index, etc). To simplify the task, the files provided include several
precomputed index files for the document collection, which were generated according to different preprocessing choices, i.e. whether a stoplist was used or not, and whether stemming
was applied or not (e.g. giving files such as index nostoplist nostemming.txt, and so on).
Correspondingly preprocessed versions of the queries are also provided (e.g. such as the file
queries nostoplist nostemming.txt, and so on). (As such, the original ‘non-preprocessed’
files documents.txt and queries.txt are provided only for information/inspection. They
are not required for the work you must do, and should not be accessed by your code.)
Code files: The materials provided include the code file ir engine.py, which is the ‘outer
shell’ of a retrieval engine, that loads an index and preprocessed query set, and then ‘batch
processes’ the queries, i.e. uses the index to compute the 10 best-ranking documents for each
query, which it prints to a results file. Run this program with its help option (-h) for information on its command line options. These include flags for whether stoplisting and/or
stemming are applied during preprocessing, which are used to determine which of the index
and query files to load. Another option allows the user to set the name of the file to which
results are written. A final option allows the user to select the term weighting scheme used
during retrieval, with a choice of binary, tf (term frequency) and tfidf modes.
The Python script eval ir.py calculates system performance scores, by comparing the collection gold standard (cacm gold std.txt) to a system results file (which lists the ids of the
documents the system returns for each query). Execute the script with its help option (-h)
for instructions on use.
The program ir engine.py can be executed to generate a results file, but you will find that it
scores zero for retrieval performance. The program does implement various aspects of required
functionality, i.e. it processes the command line, loads the selected index file into a suitable
data structure (a two-level dictionary), loads the preprocessed queries, runs a batch process
over the queries, and prints out the results to file. However, it does not include a sensible
implementation of the functionality for computing what are the most relevant documents for
a given query, based on the index. This functionality is to be provided by the class Retrieve
which ir engine.py imports from the file my retriever.py, but the current definition provided in that file is just a ‘stub’ which returns the same result for every query (which is just
a list of the numbers 1 to 10, as if these were the ids of the documents selected as relevant).
Your task is to complete the definition of the Retrieve class, so that the overall IR system
performs retrieval based on the vector space model. Ideally, your implementation should allow
retrieval under alternative term weighting schemes, as selected using the “-w” command line
flag, i.e. under binary, term frequency and TFIDF schemes. You should then evaluate the
performance of the system over the CACM test collection under alternative configurations,
arising from alternative preprocessing choices and the available term weighting schemes.
What to Submit
Your assignment work is to be submitted electronically using Blackboard, and should include:
1. Your Python code, as a modified version of the file my retriever.py. Do NOT submit any
other code files. Your implementation should not depend on any changes made to the file
ir engine.py. (Any such dependency will cause your code to fail at testing time, as this
will involve placing your code alongside a ‘fresh’ copy of the other files that are needed.)
Your code file should not open any other files at all. Rather, it should take its inputs, and
return its results solely through its interactions with the code in ir engine.py.
2. A short report (as a pdf file), which should NOT EXCEED 2 PAGES IN LENGTH (excluding a title page, should you wish to have one). The report may include a brief description of the extent of the implementation achieved (this is only really important if you have
not completed a sufficient implementation for performance testing), and should present the
performance results you have collected under different configurations, and any conclusions
you draw from your analysis of these results. Graphs/tables may be used in presenting
your results, to aid exposition.
A total of 30 marks are available for the assignment and will be assigned based on the following
Implementation and Code Style (20 marks)
How many of the alternative weighting schemes have been correctly implemented? How efficient is the implementation (i.e. how quickly are results returned)? Have appropriate Python
constructs been used? Is the code comprehensible and clearly commented?
Report (10 marks)
Is the report a clear and accurate description of the implementation? How complete and
accurate is the discussion of the performance of the system under a range of configurations?
What inferences can be drawn about the performance of the IR system from these results?
Guidance on Use of Python Libraries
The use of certain low level libraries is fine (e.g. math to compute sqrt). The use of intermediate
level libraries like numpy and pandas is discouraged. Our experience is that students using
these libraries do not use them effectively and end up producing code that is less clear and less
efficient than those who simply implement from the ground up and thus retain clear control
and understanding over what they are doing.
The use of high level libraries that implement some of the core functionality you are asked to
produce (e.g. scikit-learn or whoosh or other implementations of the vector space model or
aspects of it, computing IDF weights, etc) will be seriously penalised – this is the stuff you
are meant to do yourself!
If in doubt about whether to use any 3rd party code, ask.
Notes and Comments
1. Study the code in the file ir engine.py, particularly with a view to understanding: (i) how
the retrieval index is stored within the program (as a two-level dictionary structure, mapping terms to doc-ids to counts), (ii) the representation of individual queries (as a dictionary
mapping query-terms to counts), and (iii) how the code of the Retrieve class, that you
are to complete, is called from the main program.
2. When retrieving relevant documents for an individual query, the set of candidate documents
to be considered for retrieval are those containing at least one of the terms from the query,
i.e. the candidate set is the union of the document sets for the individual query terms.
Having computed this set, similarity scores can be computed for each, and used to rank
the candidates, so that the top 10 can be returned.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx