代做COMPSCI5089 INTRODUCTION TO DATA SCIENCE AND SYSTEMS April 2021代做Python语言

INTRODUCTION TO DATA SCIENCE AND SYSTEMS (M)

COMPSCI 5089

Monday 26 April 2021

1. Computational linear algebra and optimisation

You have been asked to help design the subcomponents of a music streaming service. The service has access to 101,750 music tracks (i.e. the audio les). Each music track can be summarised based on the audio content using so-called audio features resulting in a 15 dimensional vector, x ∈ R1x15 , for each track. The meaning and importance of the individual dimensions in the vector is unknown. The vectors for the individual tracks are collected in a matrix X as row vectors.

Aside from the audio file itself, the service has access to the title and artist for each track, the genre(s) associated with each track (e.g. jazz) and finally the popularity of each track as a scalar y ∈ R.

(a) The team wants to develop a function called What is this track called?” where users can upload an audio file with the purpose to identify the name of the track and artist. To this end we are interested in computing Euclidian distances between the music tracks based on their vector representations.

(i)  Certain aspects of X is summarised in Table  1.  Explain why it is a good idea to normalise the data in X before computing the similarity between the tracks and suggest a suitable normalisation approach. Justify your approach and make reference to specific elements in Table 1.                         [3]

(ii)  Design a simple search routine which can find the closest match between the uploaded track and a track in the existing dataset. Write the procedure using equations or NumPy code (1-3 lines). Determine how many individual distances you will need to compute and discuss any potential scaleability issues.          [3]

(b) A subcomponent of the system relies on a mapping from tracks to popularity. This can be formulated as a matrix problem: X wT - y = 0, where X is a matrix containing the music features for the tracks. w is a 15 dimensional vector and y is vector containing the popularity scores for each track. The team is interested in the most efficient and robust method for finding w using the squared error as the loss function.

(i)  Specify the dimensions of the matrix X and determine if w and y are considered row or column vectors, respectively.            [1]

(ii)  Determine a method for solving the matrix equation wrt. w. Justify your approach.          [2]

Dim.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

μ

0.1

-1.5

78.1

0.1

1.1

0.8

0.0

-159.3

0.2

0.0

0.0

0.1

0.0

-0.5

0.3

σ2

0.01

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

10.1

1.0

1.0

1.0

Min

0.09

-4.0

75.1

-2.4

-1.9

-1.5

-2.9

-162.3

-3.7

-2.5

-2.4

-24.7

-2.6

-3.6

-2.6

Max

0.12

1.6

80.5

3.0

3.8

3.2

2.3

-156.9

2.5

2.6

2.3

31.0

2.0

2.1

2.5

Table 1: Basic statistics for each dimension in X.

Figure 1: Eigenspectrum (unordered)

(c) The user interface team has requested that you provide a procedure for projecting the music tracks to 2D or 3D based on the vector representation so they can visualise the music tracks on a computer screen. You must use a linear map due to computational constraints.

(i)  Outline a procedure for nding the 3D coordinates so the projection preserves most of the variance and can be implemented using only basic Python and NumPy by a junior data scientist. You should not provide the code, but explain the individual steps in the  procedure using text or equations and only recommend the suitable NumPy commands. You must specify the dimension of the all vectors or matrices required to compute the  projection.                                                       [4]

(ii)  The eigenspectrum of the covariance matrix of X is shown in Figure 1. Discuss what the eigenspectrum says about the vectors representing the audio files and how this could be leveraged to make the system more efficient. Discuss if the team’s idea of a 2D or 3D interface is justified.                     [3]

(d) Your team is contemplating a new subcomponent which would enable users to generate a new music track. The team has already developed a function, r(x), which makes it possible to map from the vector representation, x, to the audio file.

The aim is to create a new track based on a genre profile which is a 5 dimensional vector, g R5 . Your team has provided a non-linear function, f : x g, that maps from the track vector to the 5D genre profile. Provide a solution in the form. of an optimisation problem and determine a suitable method to solve the stated problem. Justify your choice of method and explain under which circumstances it is guaranteed to converge to a sensible solution in this scenario. You will need to make assumptions which must be clearly stated.             [4]

2. Probabilities & Bayes rule

Consider a scenario where you are in charge of analysing the data and modelling a pandemic. We consider a given disease (let's call it 'VIRUS'), which has an unknown prevalence r in  the population (we will assume that r ∈ [0; 1] is the proportion of the population that has  the disease). We will write the probability that a person is diseased as p(D) = r.

(a) Your lab has developed a fast testing procedure to detect this disease. In order to evaluate the accuracy and reliability of this test, you have conducted trials on 132 subjects, and compared the results of your test with perfectly accurate (supposedly more expensive) diagnostic. The results of those trials are collated in the following table:

positive negative

diseased        28              3

healthy         12             89

(i)  Using Bayes formula, and the trial data in the table, provide an estimate of the probab- ilities:

p(Dj T ), that a subject who tested positive is truly diseased; and

p(Dj T- ), that a subject who tested negative is actually diseased.                          [4]

(ii)  Taking into consideration the test accuracy and reliability as evidenced in the trials, would this test be appropriate for the following situations:

1.  regular testing of people working with vulnerable populations;

2.  deciding on whether to administer a treatment with severe side effects; or

3.  applying to the whole population to   nd all diseased individuals (justify your answers).         [3]

(iii)  You administer a test with probabilities p(Dj T ) = 0.7 and p(Dj T- ) = 0.01 to a sample of 1000 subjects drawn randomly from the population. The test returns 980 negatives and 20 positives.

From this data, calculate an estimate of the prevalence p(D) explaining your reasoning. [4]

(b) Let us consider that you are experimenting with a vaccine against the disease. You have

1000 subjects in group A who take the vaccine and 1000 in group B who take a placebo. Let us assume that you test the subjects in both groups daily, and after one month you obtain the following results: 2 subjects from group A tested positive at some point during the month, and 40 subjects from group B. In this part we will assume that we are using a test with the following statistics:

the probability of having the disease if tested positive is p(Dj T ) = 0.7

the probability of having the disease if tested negative is p(Dj T- ) = 0.01.

(i)  Accounting for the limitations of the test, how many subjects in group A and B did possibly catch the disease during this month?      [5]

(ii)  The efcacy of a vaccine is typically calculated as

Use your results from above to calculate the efficacy of the vaccine.  Discuss what would happen if your test were less accurate: What would happen if p(Dj T ) would be lower? If p(Dj T- ) would be higher?                                                                             [4]

3. Database systems

Consider a relation Student(ID, Name, StudyPlan) – abbreviated as S – where the primary key (ID) is a 64-bit integer, the Name attribute is a 40-byte (fixed length) string, and Study- Plan is a 16-bit integer.  Further consider a relation Marks(ID, CourseID, AssessmentID, Mark) – abbreviated as M – with ID being a foreign key to Student’s ID, CourseID and  AssessmentID being 16-bit integers, Mark being a 64-bit oat, and the first three attributes  making up the relation’s (composite) primary key.

Assume that both relations are stored in files on disk organised in 512-byte blocks, with each block having a 10-byte header. Assume that S has rS = 1; 000 tuples and that M has rM = 100; 000 tuples. Last, assume that Student is stored organised in a heap le, and Marks is stored organised in a sequential file sorted by its primary key.  Note that the database system adopts xed-length records – i.e., each le record corresponds to one tuple of the relation and vice versa.

(a) Compute the blocking factors and the number of blocks required to store these relations.

Show your work.                [2]

(b) Consider the following query:

SELECT  S .Name,  M . ID,  M .Mark  FROM  Student  as  S,  Marks  as  M WHERE  S . ID  =  M . ID  AND  S . ID  >=   10,000  and  S . ID  <= 10,199;

Assume that the memory of the database system can accommodate nB = 22 blocks for processing, that all student IDs in the query range exist in the database, and that all students have the same number of marks on average.

Consider the following two approaches: (a) S is scanned at the outer loop (9 marks total) and (b) M is scanned at the outer loop (9 marks total). For each approach, explain the query processing algorithm (3 marks for each approach) taking care to include the le organisation in your reasoning, and estimate the total expected cost (in number of block accesses) for these two strategies (6 marks for each approach). Disregard the cost associated with the writing of the result set. Show your work. [18]



热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图