COMP20007 Assignment 1: Multi-word Queries
In this assignment you will implement a small part of a text search engine: using an existinginvertedfile indexfor a multi-word query to find the top-matching documents in a large document collection.
The assignment consists of 2 coding tasks and 1 written task. Each task is described in detail in thesections below.
Before you attempt this assignment you will need to download the following code and data files fromthe LMS:
main.c Entry point to program.Do not change. list.h, list.c Singly-linked list module.Do not change. index.h, index.c Inverted file index module.Do not change. query.h Prototypes for functions you must implement.Do not change. query.c Define your functions here, according to specification below. heap.h, heap.c Binary min-heap module. Empty: see week 4 lab task. Makefile To assist with compilation. Edit if you add any extra files. data/ Directory containing many term-based inverted file index files. documents.txt List of all documents indexed.
At this point, you should be able to compile the supplied code (above) by runningmake. Once yourassignment is completed, you will be able to execute it by running a command of the form
./a1 -t task -r numresults -d numdocuments list of query terms
- taskis either 1 or 2 representing which coding task to execute,
- numresultsis the number of search results to find and display,
- numdocumentsis the maximum number of documents to consider (optional; the default is alldocuments, or 131,563), and
- list of query termsis a space-separated list of words making up a query for your programto search for.
For example, the command./a1 -t 1 -r 12 algorithms are funwill run your solution to task1 on the entire document collection, printing the top 12 results matching the combination of wordsalgorithms, are, and fun.
Your solution will need to work with the data structures defined inindex.handlist.h. Heres abrief overview (consult the header files for the finer details and a full list of available functions):
- AnIndexis used to represent a multi-term inverted file index. AnIndexhas an array of stringterms. For each term, it also stores aListofDocuments.
- EachListin theIndexis a linked list ofNodes, with eachNodecontaining a pointer to a singleDocument. These lists can easily be traversed using an appropriatewhileorforloop.
- ADocumentcontains an integeridand a floating-pointscore. Theidis an ordinal integeridentifying a particular document in the overall collection (seedocuments.txt, which lists alldocuments by theirid, starting with document 0). Thescoreis a non-negative real numbercalculated based on the prevalence of the corresponding term within this document.
- Documents occur in each of the indexs lists in order of increasingid. Only documents withscoregreater than zero for a term are present in the list for that term.
Additionally, you will require an array-based binary min-heap module to correctly implement each ofthe coding tasks. It is up to you exactly how to design and implement this module. Your solution tothe week 4 lab tasks will be helpful here. Note that sample lab solutions will be released in the weeksafter each lab and that these may be used in your assignment (with proper attribution).
Furthermore, you may create any additional modules necessary to support your solution. If you addadditional modules, it is your responsibility to correctly extend your makefileyou must ensure thatyour solution can be compiled after submission simply by runningmake.
The first two tasks of the assignment require you to implement functions insidequery.c.
Task 1: Array-based accumulation approach (3 marks)
Implement the functionprintarrayresults()defined inquery.c. This function has three inputparameters: index(anIndexpointer) containing data for the query to perform; nresults (aninteger) describing the number of results to output; andndocuments(another integer) containing thetotal number of documents in the collection to be queried.
This function should find the topnresultsmatching document ids and their associated total scores.The total score for a document is the sum of its score from each term in the query. For this function,use the following method to find these top-scoring documents:
- Initialise an array ofndocumentsfloating-point numbers, all set to0.0.
- Iterate through each document list in the index. For each document in the list, add the documentscore to the corresponding position in the array (using the document id as an array index).
- Use the priority queue-based top-k selection algorithm to find the maximumnresultstotalscores in the resulting array, and their associated document ids.
The function should print these results as per the instructions in the Output format section below.
Task 2: Priority queue-based multi-way merge approach (3 marks)
Implement the functionprintmergeresults()defined inquery.c. This function has two inputparameters:index(anIndexpointer) containing data for the query to perform; andnresults(aninteger) describing the number of results to output.
This function should also find the topnresultsmatching document ids and their associated totalscores. For this function, use the following method to find these top-scoring documents:
- Use a priority queue-based multi-way merge algorithm to iterate through thentermsdocumentlists concurrently:- Initialise a priority queue to order the document lists based on the id of their first documents.- Repeatedly retrieve a document from the document list at the front of the priority queue,and rearrange the priority queue so that this list is positioned according to the id of itsnext document (stepping through the list). Stop after processing all documents in all lists.
- While iterating through the lists in this way, accumulate total scores for each document id. Usethe priority queue-based top-k selection algorithm to find the maximumnresultstotal scoresand associated document ids.
The function should print the results as per the instructions in the Output format section below.
The output of both functions should follow the same format: the top-scorin gn results results with non-zero scores, printed one result per line to stdout(e.g. via the printf() function).
The results should be printed in descending order of total score (that is, with the highest total scoringdocument on the first line, and then the second, and so on). In the case of multiple documents withthe same score, such documents may be printed in any order relative to each other. Each documentshould be on its own line. The line should be formatted with the document id printed first as a 6-digitdecimal integer (padded on the left with spaces if necessary), followed by a single space, followed bythe floating-point total score of the document to a precision of 6 digits after the decimal point. Thefollowing format string will be useful:"%6d %.6f\n"
There should never be more thannresultslines of output. If many documents have the same score asthe document with thenresultsth-highest score, any subset of these documents may be printed suchthat there are exactlynresultslines of output. Moreover, there should be no additional informationprinted tostdout(if you must print additional information, you can usestderr). Finally, onlydocuments with non-zero total scores should ever be shown. Therefore, for some queries, there mayactually be fewer thannresultslines of output.
To help you validate the output format of your functions, we provide some samples of correct outputfor three basic queries. Download the sample files from the LMS. The files contain one example ofcorrect output from a selection of commands, described in the following table.
These examples are intended to help you confirm that your output follows the formatting instructions.Note that for some of these inputs there may bemore than one correct outputdue to differentpossible orderings of documents with the same score. These samples represent only one ordering. Notealso that these samples represent only a small subset of the inputs your solution will be tested againstafter submission: matching output for these inputs doesnt guarantee a correct solution. You areexpected to test your functions comprehensively to ensure that they behave correctly for all inputs.
Filename Command helloworld-12.txt ./a1 -t 1 -r 12 hello world algorithms-99.txt ./a1 -t 1 -r 99 algorithms are fun catoutofbag-8.txt ./a1 -t 1 -r 8 the cat is out of the bag
Since the expected output format for task 2 is identical to that of task 1, you can test task 2 byreplacing-t 1with-t 2in each command.
Warning:These files contain Unix-style newlines.They will not display properly in some texteditors, including the default Windows text editor Notepad.
The final task of the assignment requires you to write a report addressing the topics described below.
Task 3: Analysis of algorithms (4 marks)
First, analyse the different approaches from task 1 and task 2 in terms of their asymptotic timecomplexity. Then, explore the effect that the various input parameters (such as the number and lengthof document lists in the query and the number of results requested) have on the time taken by yourprogram to produce its results. Support your investigation with experimental evidence gathered bytiming your program with various input configurations. Include a plot to visualise your experimentalresults in your report. To conclude, give a precise characterisation of the circumstances where weshould prefer one approach over the other.
Your report must be no more than two pages in length. You may use any document editor to createyour report, butyou must export the document as a PDF for submission. You should namethe filereport.pdf.
Via the LMS, submit a single archive file (e.g. .zipor.tar.gz) containingall files required tocompile your solution (includingMakefile) plus your report (as a PDF).When extractedfrom this archive on the School of Engineering student machines (a.k.a. dimefox), your submissionshould compile without any errors simply by running themakecommand.
Please note that when compiling your program we will use the original versions of the files markedDo not change in the Provided files list. Any changes you have made to these files will be lost.This may lead to compile errors. Do not modify these files.
Submissions will close automatically at the deadline. As per the Subject Guide, the late penalty is20% of the available marks for this assignment for each day (or part thereof) overdue. Note thatnetwork and machine problems right before the deadline are not sufficient excuses for a late or missingsubmission. Please see the Subject Guide for more information on late submissions and applying foran extension.
The two coding tasks will be marked as follows:
- 2 of the available marks will be for the correctness of your programs output on a suite oftest inputs. You will lose partial marks for minor discrepancies in output formatting, or minormistakes in your results. You may score no marks if your program crashes on certain inputs,produces completely incorrect answers, or causes compile errors. We will compile and test yourprogram on the School of Engineering student machines (a.k.a.dimefox).
- 1 of the available marks will be for the quality and readability of your code. You will lose part orall of this mark if your program is poorly designed (e.g. with lots of repetition or poor functionaldecomposition), if your program is difficult to follow (e.g. due to missing or unhelpful commentsor unclear variable names), or if your program has memory leaks.
The written report will be marked as follows:
- 1 of the available marks will be for the clarity and accuracy of your analysis of the asymptotictime complexity of the two approaches. You will lose partial marks for minor inaccuracies, ormisuse of terminology or notation.
- 2 of the available marks will be for the quality of your investigation into the observed performanceof both approaches. You will lose marks if your investigation is not supported by experimentalevidence, if your report does not include a visualisation of your results, or if your investigationdoes not fully address the topic.
- 1 of the available marks will be for the clarity and accuracy of your conclusion.
- Additional marks will be deducted if your report is too long (past the two-page limit) or is nota PDF document.