代做COMPSCI 5089 Introduction to Data Science and Systems 2022代写数据结构语言程序

DEGREES of MSci, MEng, BEng, BSc, MA and MA (Social Sciences)

Introduction to Data Science and Systems

COMPSCI 5089

1.    (a) You are designing an application for clothing shops to predict clothes size based on customer height and weight.   Suppose we have a clothing dataset with height, weight and the corresponding T-shirt size of several customers.

customer ID

height

weight

size

U1

170

60

M

U2

172

60

M

U3

173

61

M

U4

173

64

L

U5

175

67

L

U6

175

66

L

You can represent this dataset based on their vector representations by regarding height and weight as two dimensions. Now there is a new client Abel (U0) whose height is 173cm and weight is 62kg. You are asked to predict the T-shirt size for Abel.

(i)  Calculate the Euclidean distance (L2 Norm) between the new point and the existing points.         [3]

(ii)  Predict the size of Abel, based on the kNN algorithm, with k = 3 and the above calculated distances. Justify your prediction.        [2]

(b) For all answers, include in your answer document both code and the output of that code.

(i)  Calculate the covariance matrix for the clothing dataset using numpy.          [1]

(ii)  Calculate the eigenvector and eigenvalues the covariance matrix using numpy.        [2]

(iii)  Dimensionality reduction. Map the clothing dataset into principal component with the largest eigenvalue of its covariance matrix.         [2]

(c) (i) Find SVD for A = , you should include full working in your solution.        [3]

(ii)  State the relations between determinant, matrix inversion and non-singular.         [2]

2. Consider a tennis player, Ed Balls, who wants to prepare for a competition match against an opponent—let’s call him Frank Racket. In order to prepare for the match, Ed has acquired records of the 100 previous matches of his opponent and wants to study statistics of Frank’s play to choose where to focus his training.

(Here is a quick summary of the rules of tennis: https://protennistips.net/ tennis-rules/)

Ed is interested in studying Frank’s serve as this can be an important strategic advantage.

•        For a serve to be valid, it must pass the net and bounce in the diagonally opposite service box.

•        If the first serve is a fault (eg, hits the net or bounces outside the service box), the player can attempt a second serve.

If the player makes a second fault, he loses the point.

Ed wants to study where Frank’s serve bounce in the service box to plan his positioning on the court.  We have NF = 1; 000 examples of rst serve from Frank, and NS = 1; 000 examples of second serve. We want to estimate the distributions of the bounce location x for Frank’s first p(xj first) and second serves p(xjsecond).

For simplicity,

•        we denote the corner closer to the net and towards the centre of the court as position (0,0), and the corner towards the outside of the court and away from the net as (1,1).

We will ignore serves that hit the net

This means that values outside [0; 1] × [0; 1] indicate that the serve is a fault.

(a) How would you use the empirical distribution to get an estimate of p(xj first) ? Explain the steps, the parameters that need to be set and the associated trade-offs.        [4]

(b) Ed now wants to model Franks serves using a normal distribution:

(i)  Explain the parameters, their effect on the distribution and the best way to estimate them in this scenario.     [4]

(ii)  What could be the problem with this choice of model? Give an example of a situation where it would be inappropriate (you can use a diagram to illustrate your example).       [2]

(c) Ed has found that his normal model is not accurate enough for him. In order to get a more accurate modelling of the data, he decides to use a mixture of Gaussians to model his data.

Explain how the model would be parameterised, and how you would fit the model to the available data (provide the relevant equations).         [5]

3. Pretend that you are the new head of a local radio, IDSS Radio being tasked with renewing the radio’s image and programme. The radio’s programming and popularity has varied over the years and you want to use a data science approach to find the right type of programming for the local audience. To this end you start by categorising the programming of the radio between types of content:

C = {music; news; business; fiction; comedy; advertisement}

You have historical records of the proportion of each content type in the radio programme for every month over the last ten years, as well as a rating r by a sample of the audience on scale between 1 and 10, where 1 means hate it” and 10 means “love it”.

Considering a programme p = [pm;pn;pb;pf;pc;pa] R5 that gives the number of hours  for each content type, we are interested in studying the function r(p) that gives the listeners’ rating for this programme.

(a) As a first attempt, you decide to assume that the function r(p) is linear, and therefore to

solve it using linear-least-squares, of the canonical form (from the lecture notes):

(i)  Explain what each variable in this equation means in this scenario, specifying their dimension, and what would be the result.      [4]

(ii)  Could you name a reason why this may not be a good model? How could you measure this using your data?   [3]

(b) We want to try and t another model, this time assuming that viewers’ preferences peak for certain quantities of each program, and then decreases again if the quantity increases even more. We could model this quantity preference as a bell shaped distribution function over the quantity pz for each type of content z:

Bz (pz ) = αz exp(-β Ⅱpz - μz 2 )

and the overall predicetd preference for a program p as:

(i)  How many parameters do you need to estimate in this case? Explain the role of each parameter.   [3]

(ii)  What would be the most appropriate approach to t this model to your data (Note: all of the functions above are differentiable, but Bc is clearly not linear)?

Explain how you would parametrise this problem (you are not asked to solve it!)     [3]

(c) Using this model ˆ(r), how would you use optimisation to find the best program, knowing that you want to run the radio from 6am to midnight daily, and need at least 1 hour of advertisement per day to cover the radio running costs.  How would you resolve this optimisation? [2]

4.    (a) Consider a relation Weather(Id, Time, Longitude, Latitude, Temperature, Humidity), where the primary key (Id) is a 116-byte string hash code, Time is 8-byte Datetime, the other elds are stored by 32-bit oat. Assume that the relation has 30000 tuples, stored in a file on disk organised in 4096-byte blocks. Note that the database system adopts xed-length records i.e., each le record corresponds to one tuple of the relation and vice versa.

(i)  Compute the blocking factor and the number of blocks required to store this relation.   [2]

(ii)  You are told that you will need to frequently add new records and you will not often read and fetch a record. Describe in detail the file organisation that you would expect to exhibit the best performance characteristics. Explain your answer by comparing the cost of reasonable alternatives.   [3]

(b) Consider the following three relations:

Student(Id, FirstName, LastName, DateOfBirth) where

the primary key (Id) is a 32-bit integer,

FirstName and LastName are both 96-byte strings, and

DateOfBirth is a 32-bit Integer.

Course(Id, Description, Credits), or C, where:

Id, the primary key of this relation, is a 32-bit integer,

Description is a 195-byte string, and


Credits is an 8-bit integer.

Transcript(StudentId, CourseId, Mark), or T , where:

StudentId is a foreign key to the primary key (Id) in the Student relation,

CourseId is a foreign key to the primary key (Id) in the Course relation above

Mark is a 8-byte double precision oating number, and

the primary key consists of the combination of StudentID and CourseID. Assume these relations are also organised in 4096-byte blocks, and that:

•  Relation Course (C) has rC = 32 records and nC = 2 blocks, organised in a heap file, 

•  Relation Transcript (T ) has rT = 51200 records and nT = 200 blocks, organised in a

sequential le, ordered by StudentID.

•  Relation Student (S) has rS = 2000 records and nS =100 blocks, stored in a heap le and has a 4-level secondary index on StudentId.

Further assume that the memory of the database system can accommodate nB = 23 blocks for processing and that the blocking factor for the join-results block is b f rRS = 10 records per block.

Last, assume we execute the following equi-join query:

SELECT  *  FROM  Transcript.  AS  T,  Student  AS  S,  Course  AS  C

WHERE  T .StudentId  =  S . Id  AND  T .CourseId  =  Course . Id

As this is a 3-way join, assume that you need to join T with C first, with each block of intermediate results stored only in RAM (in one of the nB blocks), then joined with S.

(i)  Describe the join strategy that would be the most efficient in this case and estimate its total expected cost (in number of block accesses). Show your work.       [8]


(ii)  Compare the Naive Nested Loop Join and the Index-based Nested-Loop Join. Which one is faster? Explain why.      [2]





热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图