代写COMPSCI5089 INTRODUCTION TO DATA SCIENCE AND SYSTEMS December 2021代写数据结构程序

INTRODUCTION TO DATA SCIENCE AND SYSTEMS (M)

COMPSCI5089

Wednesday 15th of December 2021

1. Computational linear algebra and optimisation

(a) Given a collection of N documents D = {D1,..., DN}, your task is to implement a func- tionality that provides a list of suggested ‘more like this’ documents. With this problem  context, answer the following questions.

(i)  Explain how would you represent each document D ∈ D as a (real-valued) vector d. What is the dimension of each vector?        [2]

(ii)  What does the L0  norm of a document vector indicate (in plain English) as per your definition of the document vectors in the previous question?                                     [1]

(iii)  How would you define the Lp distance between two document vectors d and d′?     [2]

(iv)  What distance or similarity measure would you use for finding the set of ‘more like this’ documents for a current (given) document vector d, and why.                           [2]

(b) The probability distribution function of an n dimensional Gaussian is given by

f(x) = (x μ)T Σ1 (x μ),

where μ ∈ Rn is the mean vector and Σ ∈ Rn×n is a square and invertible matrix, called the covariance matrix. Consider the particular case of n = 2. Answer the following questions.

(i)  Plot the contours of the following Gaussians. For each contour plot, show the condi- tional distributions along the two axes.

[2]

(ii)  Which one/ones of the above 4 Gaussian distributions can be reduced to a a single  dimensional Gaussian with PCA on the covariance matrix without too much loss of information. Note that you do not need to explicitly compute the Eigenvalues. You  should rather derive your answer from a visual interpretation of the contour plots. Clearly explain your answer.                                            [2]

(c) With respect to linear regression, answer the following questions.

(i)  Derive the expression for stochastic gradient descent for linear regression with the squared loss function. Clearly introduce your notations for the input/output instances, and the parameter vector.       [4]

(ii)  Explain how linear regression can be extended to polynomial (higher order) regression? What is the problem of using high degree polynomials for regression? How can that problem be alleviated?             [3]

(iii)  A common practice in stochastic gradient descent is to use a variable learning rate α for the parameter updates

where θj(t ) denotes the j th component of the parameter vector θ at iteration t , and α(t ) denotes the value of the learning rate at iteration t . Which of the following alternatives of the learning rate update would you prefer (α is a constant) and why?

[2]

2. Probabilities & Bayes rule

Consider a card game where you have 4 suits (heart, diamonds, clubs and spades) and in each suit the cards 7, 8, 9, 10, Jack (J), Queen (Q), King (K) and Ace (A). In this question we will use the following commonly used terms:

the the pack: is the set of all cards that have not been drawn yet.

•  to draw: is to pick a card at random amongst the pack of remaining cards, removing it from the pack.

the hand: is the set of cards a player has drawn from the pack.

a payout: is the amount of points you get for a given hand.

•  to fold: is to stop playing and put back your cards in the pack, forfeiting any payout for this game.

(a) Assuming that you draw a single card at random from the pack, give the probabilities for the following events

(i)  Drawing an Ace?

(ii)  Drawing a red card?

(iii)  Drawing a diamonds?

(iv)  Drawing a royalty figure (Jack, Queen or King)?

(v)  Drawing the Ace of spades?                 [5]

(b) Now assume that you have already drawn the three cards:  10,J,Q. When drawing two more cards from the pack, what is the probability to obtain:

(i)  A pair of two cards with the same value (eg, two Jacks).

(ii)  Two pairs (eg, two Jacks and two Queens).

(iii)  Three of a kind (eg, three Jacks).

(iv)  A sequence of 5 cards (eg, 10, J, Q, K, A). Note that the cards can be of any suit, but there cannot be a break in the sequence.  [4]

(c) Now let us assume the following payout table for each hand of 5 cards:

As before, you have the cards 10, J, Q in hand.

(i)  If you draw two more cards randomly from the deck, what is the expected value of the payout for this hand?                            [3]

(ii)  Assuming that you need to pay 5 every time you draw a card (hence you would need to pay 10 to draw two cards), should you fold your hand or draw cards?                       [2]

(iii)  Should you fold after drawing the first card (and having paid 5), if the card is: (i) the 7 of heart, (ii) the 8 of spades or (iii) the Queen of diamonds?                                     [6]

3. Database systems

An online retail company is trying to assess the performance of its DB systems and has asked you to investigate some of the operations.  Consider a relation Seller (ID, Name, Country) – abbreviated as S – where the primary key (ID) is a 32-bit integer, the Name attribute is a 54-byte (fixed length) string, and Country is a 16-bit integer. Further consider a relation Product(ID, ProductID, ManufacturerID, Price) – abbreviated as P – with ID being a foreign key to Seller’s ID, ProductID and ManufacturerID being 64-bit integers, Price being a 32-bit float, and the first three attributes making up the relation’s (composite) primary key.

Assume that both relations are stored in files on disk organised in 512-byte blocks, with each block having a 10-byte header. Assume that S has rS = 1, 000 tuples and that P has rP = 100, 000 tuples.  Last, assume that Product is stored organised in a sequential file sorted by its primary key and Student is stored organised in a heap file. Finally, note that the database system adopts fixed-length records – i.e., each file record corresponds to one tuple of the relation and vice versa.

(a) Compute the blocking factors and the number of blocks required to store these relations.

Show your work.           [2]

(b) Consider the following query:

SELECT  S .Name,  P . ID,  P .Price  FROM  Seller  as  S,  Product  as  P WHERE  S . ID  =  P . ID  AND  S . ID  >=   6,000  and  S . ID  <= 6,199;

Assume that the memory of the database system can accommodate nB = 22 blocks for processing, that all seller IDs in the query range exist in the database, and that all sellers have the same number of products on average.

Explain the query processing algorithm, taking care to include the file organisation in your reasoning. Then estimate the total expected cost (in number of block accesses) of the query above (Disregard the cost associated with the writing of the result set), of the following two approaches:

(i)  First, assume that S is scanned at the outer loop and show your work:                      [9]

(ii)  Second, assume instead that P is scanned at the outer loop and show your work:     [9]





热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图