代做COMPSCI 5011 INFORMATION RETRIEVAL 2022代做留学生SQL语言程序-留学生作业帮

代做COMPSCI 5011 INFORMATION RETRIEVAL 2022代做留学生SQL语言程序

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

INFORMATION RETRIEVAL M

COMPSCI 5011

Friday 29 April 2022

1 (a)

The following documents have been processed by an IR system where stemming is not applied:

DocID	Text
Doc1	france is world champion 1998 france won
Doc2	croatia and france played each other in the semifinal
Doc3	croatia was in the semifinal 1998
Doc4	croatia won the other semifinal in russia 2018

(i) Assume that the following terms are stopwords: and, in, is, the, was.

Construct an inverted file for these documents, showing clearly the dictionary and posting list components. Your inverted file needs to store sufficient information for computing a simple tf*idf term weight, where wij = tfij *log2(N/dfi) [5]

(ii) Compute the term weights ofthe two terms “champion” and “ 1998” in Doc1.

Show your working. [2]

(iii) Assuming the use of a best match ranking algorithm, rank all documents using

their relevance scores for the following query:

1998 croatia

Show your working. Note that log2(0.75)= -0.4150 and log2(1.3333)=

0.4150. [3]

(b)

(i) In Web search, explain why the use of raw term frequency (TF) counts in

scoring documents can hurt the effectiveness of the search engine. [2]

(ii) Suggest a solution to alleviate the problem, and show through examples how it might work. Explain through examples how modern term weighting models in IR control the raw term frequency counts. [3]

(c) Assume that you have decided to modify the approach you use to rank the documents of your collection. You have developed a new Web ranking approach that makes use of recent advances in neural networks. All other components of the system remain the same. Explain in detail the steps you need to undertake to determine whether your new Web ranking approach produces a better retrieval performance than the original ranking approach. [5]

(a) Consider a corpus of documents C written in English, where the frequency distribution of words approximately follows Zipf’s law r * p(wr |C) = 0.1, where r = 1,2, …, n is the rank ofa word by decreasing order of frequency. wr is the word at rank r, and p(wr |C) is the probability of occurrence of word wr in the corpus C.

Compute the probability of occurrence of the most frequent word in C. Compute the probability of occurrence of the 2nd most frequent word in C. Justify your answers. [4]

(b) Consider the query “michael jackson music” and the following term frequencies for the three documents D1, D2 and D3, where the search engine is using raw term frequency (TF) but no IDF:

	indiana	jackson	life	michael	music	pop	really
D1	0	4	1	3	0	6	1
D2	4	0	3	4	1	0	2
D3	0	4	0	5	4	4	0

Assume that the system has returned the following ranking: D2, D3, D1. The user judges D3 to be relevant and both D1 and D2 to be non-relevant.

(i) Show the original query vector, clearly stating the dimensions of the vector. [2]

(ii) Use Rocchio’s relevance feedback algorithm (with α=β=γ=1) to provide a revised query vector for “michael jackson music”. Terms in the revised query that have negative weights can be dropped, i.e. their weights can be changed back to 0. Show all your calculations. [4]

(c) Suppose we have a corpus of documents with a dictionary of 6 words w1, ..., w6. Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the word counts for doc1 (third column) and doc2 (fourth column), where ct(w, doci) is the count of word w (i.e. its term frequency) in document doci. Let the query q be the following:

q = w1 w2

Word	p(w\|C)	ct(w, doc1)	ct(w, doc2)
w1	0.8	2	7
w 2	0.1	3	1
w3	0.025	2	1
w4	0.025	2	1
w 5	0.025	1	0
w6	0.025	0	0
SUM	1.0	10	10
Word	p(w\|C)	ct(w, doc1)	ct(w, doc2)

(i) Assume that we do not apply any smoothing technique to the language model for doc1 and doc2. Calculate the query likelihood for both doc1 and doc2, i.e. p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest. [3]

(ii) Suppose we now smooth the language model for doc1 and doc2 using Jelinek- Mercer Smoothing with λ = 0.1. Recalculate the likelihood of the query for both doc1 and doc2, i.e., p(q|doc1) and p(q|doc2) (Do not compute the log- likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest. [4]

(iii) Explain which document you think should be reasonably ranked higher (doc1 or doc2) and why? [3]

(a) How would the IDF score of a word w change (i.e., increase, decrease or stay the same)

in each of the following cases: (1) adding the word w to a document; (2) making each

document twice as long as its original length by concatenating the document with itself;

(3) Adding some documents to the collection. You must suitably justify your answers. [4]

(b) Explain in detail why positive feedback is likely to be more useful than negative feedback to an information retrieval system. Illustrate your answer using an example from a suitable search scenario. [4]

(c) Neural retrieval models often use a re-ranking strategy over BM25 to reduce computational overhead.

Explain the key limitation of this strategy. Describe in sufficient details an approach that you might use to overcome this problem. [5]

(d) Consider a query q, which returns all webpages shown in the hyperlink structure below.

(i) Write the adjacency matrix A for the above graph. [1]

(ii) Using the iterative HITS algorithm, provide the hub and authority scores for all the webpages of the above graph after a complete single iteration of the algorithm. Show your workings. [3]

(iii) Describe in sufficient details an alternative approach to compute the hub and authority scores for the above graph. You need to show all required steps to generate the scores, but you do not need to actually compute the final scores. [3]

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们

EMail: 99515681@qq.com

QQ: 99515681

留学生作业帮-留学生的知心伴侣！

工作时间：08:00-21:00

微信客服：codinghelp

热门主题

课程名