代做Implement a simple IR tool代写留学生Python程序

Assignment Objectives

Implement a simple IR tool that does the following:

· Pre-processes text
- Tokenisation
- Stopword removal using this list of words (right click and save as). You MUST use this list.
- Porter stemming. You can use packages for this part, such as Snowball or NLTK.

· Creates a positional inverted index
Your index can have whatever structure you like, and can be stored in any format you like, but you will need to output it to a text file using the format specified below.

· Uses your positional inverted index to perform.
- Boolean search
- Phrase search
- Proximity search
- Ranked IR based on TFIDF

Additional details

· Use the collections from Lab 2 for testing your system. You can download from here. Focus on the trec collection which contains 1000 sample news articles.
Note 1: use the file in XML format. You need to make your code able to parse it. It worth noting that the XML format is a standard TREC which might not be parsed directly using XML parsers. You might need to add header and footer to the file to make it parsable by existing tools. It is allowed to do so if needed (or feel free to code your own parser).
Note 2: For the trec collection, please include the headline of the article to the index. Simply the document text should include the headline and text fields. For positions of terms, please start counting from the headline and continue with the text.
Note 3: Term position is counted AFTER stop words removal.

· The test collection to be released 4 days before the deadline. It will be of the same exact format of trec.sample.xml. The size of this collection will be around 5000 documents (5 times of the current version). If your system is running fine with the current collection, it should be straightforward to run smoothly with the new collection.

· For tokenisation, you can simply split on every non-letter character, or you can have special treatment for some cases (such as - or '). Please explain in your report your selections and why you did so.

· For stopping Please use the stop words list mentioned above.

· Again, for stemming, you do NOT need to write your own stemmer. Use any available package for Porter stemmer. Write down in your report which one used. You need to use Porter stemmer, not anything else.

· For the TFIDF search function, please use the formula from lecture 7, slide 15 with title "TFIDF term weighting". Note: this is different from other implementations of TFIDF in some toolkits, so please implement it as shown in lecture.

· Please use the queries in Lab 2 and Lab 3 for testing your code. A new list of queries to be released with the collection 4 days before the deadline. These 4 days should be more than enough to run the new queries and get the results.

· Notes about the expected queries:
- Queries are expected to be very similar to that in the labs.
- Two query files will be provided, one for Boolean search, and the other for ranked retrieval.
- Boolean queries will not contain more than one "AND" or "OR" operator at a time. But a mix between phrase query and one logical "AND" or "OR" operator can be there (like query q9 in Lab 2). Also "AND" or "OR" can be mixed with NOT, e.g. Word1 AND NOT Word2.
- 10 queries would be provided in a file named queries.boolean.txt in the following format:
1 term11 AND term12
2 "term21 term22"
- For proximity search queries, it will have the format "#15(term1,term2)", which means find documents that have both term1 and term2, and distance between term1 and term2 is less than or equal to 15 (after stop words removal).
- 10 free text queries for ranked retrieval will be provided in a file named queries.ranked.txt in the following format:
1 this is a sample query

Submissions and Formats

You need to submit the following 5 files:

1. index.txt: a formatted version of your positional inverted index. Each line of this file describes a token, a document it appears in, and its position within that document:

term:df

docID: pos1, pos2, ....

docID: pos1, pos2, ....

where df is the document frequency of the term. Example:

newspaper:2

23: 2,15

93: 234

.......

Here, the token "newspaper" appeared in 2 documents, where it appeared in document 23 twice at positions 2 and 15, and document 93 once in position 234.

Each document ID (docID) should be in a separate line that starts with a "tab" ("\t" in Python). Positions of term should be separated by a comma. Lines should end with a line break ("\n" in Python).

2. results.boolean.txt: contains results of the queries.boolean.txt in the following query_number,document_id format:

3.     1,710

4.     1,213

2,103

This means that for query "1" you retrieved two documents - numbers "710" and "213".

For query "2" you retrieved one document - "103".

The two values on each line should be separated by a comma. Lines should end with a line break ("\n" in Python).

Your boolean results file should list every matching document, per query.

5. results.ranked.txt: contains results of the queries.ranked.txt in the following query_number,document_id,score format:

6.     1,710,0.6234

7.     1,213,0.3678

2,103,0.9761

This means that for query "1" you retrieved two documents - document number "710" (with score 0.6234) and document number "213" (with score 0.3678).

For query "2" you retrieved one document, number "103", with score 0.9761.

Scores should be rounded to four decimal places.

The three values on each line should be separated by a comma. Lines should end with a line break ("\n" in Python).

Print results for queries in order of their score - that is, all results for query "1" are sorted by score, then results for query 2, 3, ... 10.

Your ranked results file should list only up to the top 150 results per query.

8. code.py: a single file containing the code used to generate index.txt, results.boolean.txt and results.ranked.txt.
If you will use something other than Python, let us know before submission!
Please try to make your code as readable as possible: commented code is highly recommended.
Please DO NOT submit the collection file with the code!

9. report.pdf: Your report on the work.
This is 2 to 4 pages and should include:
- details on the methods you used for tokenisation and stemming
- details on how you implemented the inverted index
- details on how you implemented the four search functions
- a brief commentary on the system as a whole and what you learned from implementing it
- what challenges you faced when implementing it
- any ideas on how to improve and scale your implementation
Your report SHOULD NOT contain any code or screenshots of code.
It should be a high-level description of how your code works and the decisions you made while implementing your information retrieval system.

Submit ONLY these five files! Do not submit any of the assignment files.

Submission should be done over Learn

Challenge (for extra marks)

If you aim for extra marks in this CW and potentially achieving fullmark, you can do this challenge.
Rerun the same queries for Boolean and Ranked IR, but without applying stopping this time in indexing or queries. Comment on the change you noticed in the retireved results in your report. No need to submit the results files. We just need your comment on the observed results and how they changed.
Notes:

· Only attempt this challenge after you implement the whole system. Main marks will be on the results above not the challenge.

· You don't need to wait for test data to run this experiment, since we don't need the results files here. We only need your comment about your observed change in results in the report, which you can test on the lab data.

· Please comment in the report about the changes you noticed in: 1) Retreived results, 2) Processing time for index and queries, 3) size of the index. Feel free to add any other comment on this run.

· We expect only small number of students to do this challenge for extra marks. So don't feel bad if you didn't get the time to do it.

Marking

The assignment is worth 10% of your total course mark and will be scored out of 100 points as follows:

· 50 points for the outputs of your system, including index.txt (10 points), results.boolean.txt (20 points), and results.ranked.txt (20 points)
- Note: Your output does not have to exact match reference outputs, since different configurations will lead to different outputs. However, it has to still align with it, so it does not list totally non-relevant documents!

· 10 points for having your code aligned with what is mentioned in the report.

· 40 points for a clear and detailed report (including 20 points for those who did the challenge).

· -20 point as a penalty if the format of files is not as described above.

· -20 point as a penalty if your code is not submitted as a single code file.

Files

cw1collection.zip



热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图