DATABASE代写:Information Retrieval Evaluation CE205 Assignment 2

Information Retrieval Evaluation CE205 Assignment 2
1. Specification
The aim of the project is to index a set of documents in Whoosh, devise a set of test queries and evaluate the system on those queries.
2. Indexing the Documents
Firstly, index the complete Wikipedia by following the instructions in Section 10 below.
3. Topic and Questions
3.1 Think of an application domain (i.e. a subject) which is of
interest to you
For example you could choose health, politics, sport, geography, music (any kind) etc. The topic needs to be covered in the collection, but it is quite large and you will find you have a wide choice of topics. Note however that this version of the Wikipedia is a few years old, meaning that the latest athletes, singers etc may not be mentioned.
3.2 Now, devise twenty questions in your chosen domain
What are the species of big cats? Where are white tigers found? What is a
Tigon? How many lions in an average pride?
Note: These are just examples chosen by a student who was interested in big cats.
You can choose any topic at all provided that:
• It is covered by the documents you are using;
• You can think of some quite difficult queries on your chosen topic.

3.3 Convert questions to queries in the Whoosh query language
4. Retrieval Experiments
4.1 Test the performance of your system using BM25
First, run the Whoosh queries, as prepared above, through the system and analyse the files returned for each to determine their relevance. The Web interface makes this easy to do.
Second, compute precision and recall at the following levels of n (where n is the number of documents considered): n=5, n=10.
To do this, for each query you need to look at the first ten results (i.e. files) returned and see for each file whether it is Relevant or Not Relevant. A file is relevant if it contains the answer to your question. It does not matter where in the file the answer occurs as long as it is present somewhere.
This is not as easy as it sounds since there will be occasions when you will not be sure. You need to make a note of the rationale for making your final decision in cases of doubt.
Computing recall poses a problem in that we need to know for each query all the correct answers in the collection. Strictly, we cannot know that without inspecting every document in the collection. At TREC they use a pooling method as discussed in lectures. To get around the problem here, simply check the first 20 documents returned for each query. Count the number of correct responses there and assume that these are all the correct responses in the collection. Then use this information to compute recall at n=5 and n=10 as above.

4.2 Test the performance of your system using TF*IDF
Repeat the above steps using TF*IDF. You do not need to re-index the collection, just use a different query function as shown in the Python program. You can select between BM25 and TF*IDF for each query via the Web interface.
In the report, you should compare the results of BM25 and TF*IDF (see report template file).
4.3 Carry out an additional experiment
For extra marks, you can carry out one of two additional experiments. First, you can investigate the effect of the parameters B and K1. You will see from the code that they are currently set to 0.75 and 1.2 respectively. Have a look in the literature and see if you can find some hints about good values for these, depending on the length, content and other characteristics of your documents. Then change these settings in the code, re-run your queries and re-compute your results. In your report, reference the article(s) you looked at and include your rationale for the new settings as well as describing how the results differ.
Second, you could look at the Named Entity classification as indicated by the class tag in the documents. It can be PERSON, LOCATION, ORGANIZATION. For example, search for ‘10152’ and it will come up with a document on Erasmus. At the top you will see PERSON. This means the program which created this Wikipedia dump decided that Erasmus was a person which in fact is true. Your task here is to establish how accurate these tags are. You need to make a list of at least twenty documents which have a non-null class of PERSON, twenty of LOCATION and twenty of ORGANIZATION. Now analyse these documents and decide for each if the classification is correct or not. Put the results in a table and give a short text analysing the results. For example, it may in some cases not be easy to decide if a classification is right or not.
5. What to Hand in
Write up your results in a short report USING THE TEMPLATE SUPPLIED with the following headings exactly as shown in the template:
1. Topic and Queries – What topic you chose and why; how the queries were devised.

2. Indexing the Documents
– How was this done? – What problems were encountered and how were they solved?
3. BM25 Performance
3.1 Method – short text outlining what you did.
3.2. Results – a table summarising the numerical results.
3.3. Discussion – a short description of what the results show (did the system perform well or not), any interesting problem cases, any technical problems encountered and so on.
4. TF*IDF Performance
4.1 Method – short text outlining what you did.
4.2. Results – a table summarising the numerical results as above.
4.3. Discussion – a short description of what the results show (was TF*IDF always better, always worse or sometimes better/worse?), any interesting problem cases, any technical problems encountered and so on.
5. Additional Experiment (Optional)
If you like, carry out one of the additional experiments as described above.
Appendix 1
– include the queries you used for your BM25 evaluation and the IDs of the right answers found for each (if any).
Appendix 2
– include the queries you used for your TF*IDF evaluation (identical to Appendix 1) and the IDs of the right answers found (not identical to Appendix 1).
Note that the ID is the field called docno in the document collection, like this: 1000003

Here the docno is 1000003. It is easy to find as this would be called 1000003.xml so you will see it in the search results.
The length of the project should be about six pages.
6. How to Submit
Use the Faser Electronic Submission Page. Submit one pdf file to Faser called:
This will be your report, created according to the report template.
Note that the filename is all lower case, it does not contain any spaces and it is a .pdf file.
givenname is your given name with no spaces and all lower case e.g. alba. surname is your surname with no spaces and all lower case e.g. garcia.
Please note that files which are in .doc, .docx, .rtf etc are not acceptable. Only .pdf is allowed.
7. Submission Deadline
11h59 Monday, 15th January 2017.
8. Marking Scheme
Assignment 2 counts for 10% of the final mark for the CE205 Module.
Characteristics of an excellent project (80%):
• Very carefully crafted choice of queries which are asking something of substance concerning a range of aspects of your chosen subject;
• Thorough description of the indexing process and explanation of any problems encountered and their solutions;
• Both BM25 and TF*IDF experiments carried out (Sections 4.1 and 4.2 above);

• Very thorough analysis of the results for both BM25 and TF*IDF;
• One additional experiment carried out (Section 4.3 above);
• Generally excellent report. Characteristics of a good project (60%):
• Good and carefully thought-out queries;
• Indexing carried out successfully;
• Experiment carried out well with BM25 (Section 4.1 above);
• Experiment carried out with TF*IDF in addition to BM25 (Section 4.2 above);
• Good discussion of results;
• Generally good report.
Characteristics of a fair project (40% or less):
• Queries of some kind were composed;
• Indexing was carried out successfully;
• Experiment carried out adequately with BM25 (Section 4.1 above);
• Minimal report.
9. Plagiarism
You should work individually on this project. Anything you submit is assumed to be entirely your own work. The usual Essex policy on plagiarism applies.

10. Instructions for Indexing
We have completed several labs on Whoosh so you should be able to get it working without difficulty.
Now, check through ce205_17_lab06_whoosh_ir_2.pdf as it contains updated instructions which explain how to use the web interface. Make sure you have everything running correctly with the small set of files (i.e. just part-0) before proceeding.
To index the complete Wikipedia, you just need to change one line in awf:
files = glob.glob( ‘c:\\wikipedia\\resources\\wikipedia\\processed\\part-0\\*.xml’
files = glob.glob( ‘c:\\wikipedia\\resources\\wikipedia\\processed\\*\\*.xml’ )
It is already there, you just need to comment the first and uncomment the second.
On any lab machine there is a directory on the C drive, based on your username, which you can use. This contains plenty of disk space. No one else can access this directory. However, it may be wiped out during the night, so it is only temporary. Suppose your username is abcd, the path of your directory is c:\Users\abcd .
The recommended procedure for indexing is as follows:
• GotoaLab;
• Copy your ir directory from m: to c:\Users\abcd (assuming your username is abcd);
• Check that you have in c:\Users\abcd\ir. Check that ir contains directories indexdir2 and results. Check that results contains header.xml and wiki.css (see ce205_17_lab06_whoosh_ir_2.pdf for detailed instructions);
• Check that the Wikipedia files are actually on your C drive at the above path;
• Create the directory indexdir2;

• Make sure the line files = glob.glob… in the program points to the whole Wikipedia (see above);
• Confirm that you are running a CMD window and your current directory there is c:\Users\abcd\ir;
• In the CMD window do:
from awf_analyse_wiki_files_34 import *
• Indexing should commence. You will see the filenames on the screen so you
can check progress. You can expect the initial indexing to take about 60 minutes on a Lab machine. Once all the files have been indexed, Whoosh then creates the final inverted index. This can take a further 35 minutes during which nothing seems to happen in the CMD window. However, if you check the free space in another CMD window you will see that it is still being used up. Eventually, the Python prompt should appear meaning you have finished.
• Having created the index, it will be in indexdir2 and is about 1.4GB in size. This file will be deleted automatically during the night so you must use it immediately. You can save it to a USB drive or memory stick for later use. In such a case, you will need to restore indexdir2 to exactly the current contents (possibly on another Lab machine). The absolute paths of your indexdir2 and your Python program must be exactly the same on the new machine.
• Note that the file part-60000\1166530.xml is a rogue file which crashes the indexer. Thus you will see it is excluded in the Python program.
The procedure for carrying out your retrieval experiments is as follows:
• In a CMD window, start the web server (see Whoosh lab 2 for detailed
• Submit each of your twenty queries to the system via the Web interface and
inspect the results by clicking on them;
• Remember that files saved on the C drive are not permanent and may be deleted during the night. Save to USB drive so you can resume later.


电子邮件地址不会被公开。 必填项已用*标注