# Computing代写 | The Context of Data Linkage

Question 11 pts

Select all that are correct statements in the context of data linkage.

Group of answer choices

The Pair Completeness score is likely to decrease if the sizes of all blocks are large.

For any blocking function, blocking reduces the original complexity of O(n^2) for pairwise comparison to a linear complexity

Assuming each record is allocated to exactly one block and that all blocks are equally sized, a blocking method that produces more blocks will have a higher reduction ratio.

Question 24 pts

Consider the following XML file:

<?xml version=”1.0″?>

<subject code=”COMP20008″>

<URL> https://handbook.unimelb.edu.au/subjects/comp20008 </url>

<name> Elements of Data Processing </name>

</subject>

<semester>1</semester>

<year/>

(a) Modify the XML so that it is well formed.

(b) Explain why the data format is said to be semi-structured.

Question 34 pts

Consider the following temperature data from various weather stations in Victoria:

16, 12, 15, 18, 13, 43, 10

The values are comma separated.

(a) Will the 43 value be classified as an outlier on the Tukey plot? Demonstrate how you arrive at the conclusion.
(b) Suggest an imputation method for the data and justify your choice.

Question 42 pts

Consider the following two plots:

Plot (1) is a VAT plot

Plot (2) is a scatter plot of the first 2 Principal Components of the data.

The data scientist states that the two plots are created from the same dataset.  Do you believe the statement? Justify your answer.

Question 53 pts

Consider a dataset with 10000 rows and 500 features. Give three reasons why we might want to apply PCA while analysing the dataset.

Question 68 pts

1. a) Explain with examples what supervised and unsupervised learning. is and what the key differences are.   4 points
2. b) Assume you need to build a model from medical data that predicts if a patient suffers from a particular illness or not. How would you decide whether to use supervised or unsupervised learning? 4 points

Question 74 pts

Assume you use k-nn clustering on a data set. Describe a method for choosing the best value for k? E-mail: itcsdx@outlook.com  微信:itcsdx 