代做COMPSCI 5096 TEXT AS DATA 2022代写留学生Java程序-留学生作业帮

代做COMPSCI 5096 TEXT AS DATA 2022代写留学生Java程序

DEGREES OF MSc, MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

TEXT AS DATA M

COMPSCI 5096

Tuesday 3 May 2022

1. Question on Distributional Semantics and Word Embedding. (Total marks: 18)

Consider the problem of finding an outlier word from among a list of other similar words, e.g., out of the following set of words -

linux, windows, solaris, android, java,

the word ‘java’ is an outlier (because the other words are names of operating systems).

Given a list of such words your task is to automatically infer the outlier word. With respect to this task, answer the following questions.

(a) An approach to solve the word intrusion problem is to represent words as vectors and then make use of the relative distances/similarities between the vectors for finding the outlier. Assuming you know (by the output of some process) the vector w for a word w, describe the pseudo-code of finding the outlier word.

Task: Describe the pseudo-code of this algorithm. Clearly state your assumptions and introduce your notations in the algorithm. [5]

One solution to the word outlier detection problem that does not require learning any parameters (via gradient descent) is the distributional semantics vector approach, where each word is represented via a bag-of-words vector of contexts. Now, answer the following: (b) The window size, k, used to define the contexts for each word is an important parameter of this approach. What happens if k is too large or too small? [2]

(c) Describe the pseudo-code of this approach that requires only a single pass through a collection (clearly describe the data structures for an efficient solution). [5]

(d) Discuss (with an example) why the vectors of function words (frequent words, such as ‘of’, ‘the’ etc.) obtained with this approach are not reliable. [2]

Now consider word2vec, which is a noise contrastive estimation based method that learns the vectors for each word. With respect to word2vec, answer the following questions.

(e) What is the role of negative samples in the objective function of word2vec? [2]

(f) Comment about word2vec’s output for a word with multiple meanings, such as jaguar, bank or python. What would you expect to find as the nearest neighbors of such polysemous words? What is the problem if you use such vectors from such words for another task such as text classification? [2]

2. Question on word frequencies and language model (Total marks: 15)

An alien probe crashes to Earth containing a short passage of alien text. The alien text uses a ﬁv e letter alphabet: [a, b, c, d, e] with no punctuation or spaces. Below is a short section of the text:

abcaedabccbaedabceda

(a) Using character n-grams, write out all of the trigrams that appear more than once with their frequency for the sample text above. [3]

Example Answer:

trigram frequency

abc 3

eda 3

aed 2

dab 2

(b) Provide the theoretical maximum number of character n-grams for the alien probe full text for n = 1, 2, 3, 4 and 5. The full text found in the probe is 593 characters long. [3]

Example Answer:

n max n-grams

1 5

2 25

3 125

4 590

5 589

(c) A linguist makes a breakthrough in understanding the tokens used in the alien text. She provides two possible ways to tokenize the sample text.

(i) In plain English, explain a single rule that could reproduce this ﬁrst tokenization

a bca eda bccba eda bceda

[1]

Example Answer: Start a new token whenever the previous character is ’a’ .

(ii) In plain English, explain a single rule that could reproduce this second tokenization

ab caedab ccbaedab ceda

[1]

2. Question on word frequencies and language model (Total marks: 15)

An alien probe crashes to Earth containing a short passage of alien text. The alien text uses a five letter alphabet: [a, b, c, d, e] with no punctuation or spaces. Below is a short section of the text:

abcaedabccbaedabceda

(a) Using character n-grams, write out all of the trigrams that appear more than once with their frequency for the sample text above. [3]

(b) Provide the theoretical maximum number of character n-grams for the alien probe full text for n = 1, 2, 3, 4 and 5. The full text found in the probe is 593 characters long. [3]

(c) A linguist makes a breakthrough in understanding the tokens used in the alien text. She provides two possible ways to tokenize the sample text.

(i) In plain English, explain a single rule that could reproduce this first tokenization

bca

eda

bccba

eda

bceda

[1]

(ii) In plain English, explain a single rule that could reproduce this second tokenization

caedab

ccbaedab

ceda

[1]

(d) More alien probes crash land in different parts of the world. Scientists want to measure the similarity between the text found in each probe. Here are two tokenized probe texts fragments.

Probe Text A:

eda

bceda

eda

bcda

bce

Probe Text B:

eda

bcba

eda

bceda

eda

bce

(i) Calculate the Sørensen–Dice Coefficient and Jaccard Similarity between the two probe texts. Show your work.

[4]

(ii) Calculate the similarity between the third probe text (Probe Text C below) and the two prior probe texts using the Sørensen–Dice Coefficient. Using these results, show that the Sørensen–Dice Coefficient is a semi-metric as it breaks the triangle inequality.

Probe Text C:

beda

bceda

bceca

ebeda

bceda

[3]

3. This question is about Natural Language Processing (Total marks: 18)

You just landed an awesome job at the Intellectual Property Office. As your first project, you have been tasked with automatically classifying submitted patent applications into one of the eight broad International Patent Classification sections, as shown here:

(a) You start by applying a typical pre-processing pipeline that consists of case normalisation and a stemmer. Within the context of patent classification application, clearly justify these two pre-processing stages and provide an example that shows why it could lead to improved classification performance. [4]

(b) You recall from Text as Data that NLP features, such as parts of speech, are often helpful for classification tasks. Within the context of patent classification, provide and justify a specific example where considering a word along with its part-of-speech may help distinguish between two of the above sections. [3]

(c) Armed with the above intuition, you select an off-the-shelf part-of-speech tagger (based on a Hidden Markov Model) that reports 97% accuracy and apply it to some sample patents to ensure that it produces reasonable part-of-speech tags. To your dismay, you find that it frequently makes mistakes. On closer inspection, you observe that the errors are usually on specialised, domain-specific language in the patents. Explain why this problem arises and what you could do to fix it. [4]

(d) You want to identify whether two systems (called System A and System B) are better than a baseline method at the classification task. The following table shows intrinsic evaluation metrics obtained over the classification on the train and test sets:

	Train Set		Test Set
	Precision	Recall	Precision	Recall
Baseline	0.61	0.42	0.56	0.50
System A	0.62*	0.43*	0.58*	0.51
System B	0.67*	0.48*	0.51	0.42

* statistical significance w.r.t. baseline (t-test with p-value < 0.05)

Discuss the effectiveness (e.g., generalizability, overfit/underfit, performance on training/test sets etc.) of the models A and B in comparison to the ‘Baseline’ method. [3]

(e) Meanwhile, another team has been busy building a BERT-based text classifier, and they have found that it also works well on the task. You decide to join forces with them. Without using an ensemble approach, how might you go about including explicit parts-of-speech into their BERT-based model? How is the technique different than the approach you took in your linear bag-of-features model? [4]

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们

EMail: 99515681@qq.com

QQ: 99515681

留学生作业帮-留学生的知心伴侣！

工作时间：08:00-21:00

微信客服：codinghelp

热门主题

课程名