# 大数据代写｜Part 2: Short answer questions

## Question 11 (6 marks)

(a) Explain the receiver operating characteristic curve in classification. (2 mark)

(b) Explain the following concepts in hypothesis testing: significance level, p-value, t-statistic, and
confidence interval. (2 marks)

(c) Describe the situation for which you will prefer to use Student’s t-test, Welch’s t-test, and Wilcoxon
Rank-Sum test to conduct hypothesis testing, respectively. (2 marks)

## Question 12 (5 marks)

(a) Describe the following concepts in association rule mining: support, confidence, lift, and leverage. (2
marks)

(b) Answer the following three questions related to the figure below: i) What does each small circle
represent? ii) What could the small circles residing on the same “straight line” share? iii) What causes
the existence of multiple different “straight lines”? (3 marks) ## Question 13 (7 marks)

(a) Explain what the concept “residual” means in linear regression. (2 mark)

(b) After training a linear regression model, you plot the residuals shown in the figure below. Describe what
issue you can observe from this figure and how you would deal with this issue. (2 marks) (c) You are building a decision tree in which the output variable is “Play” and one of the attributes is
“Weather”. The variable “Play” can take the value of “Yes” or “No” while the attribute “Weather” can
take the value of “Rainy”, “Overcast”, or “Sunny”. Based on the probabilities provided in the following
two tables, show how you compute the information gain for the attribute “Weather”. (3 marks) ## Question 14 (6 marks)

(a) Assume that you are working on a classification problem as a data scientist. It is found that the dataset
contains many correlated variables and most of them are categorical variables. Which of the following
classifiers would be most suited for modelling this dataset: logistic regression, decision tree, and naïve
Bayes classifier. Explain your answer. (3 marks)

(b) A company would like to monitor what is being said about its products in social media. The company is
interested in 1) whether people mention its products and 2) what is being said, good or bad. Describe
your plan as data scientist for this task. (3 marks)

## Question 15 (6 marks)

(a) Image representation plays a key role in image analysis. Name an image representation that is inspired
by the bag-of-words model in text analysis and describe how this image representation is obtained for
an image. (3 marks)

(b) Describe deep convolutional neural networks for image classification. (3 marks) E-mail: itcsdx@outlook.com  微信:itcsdx 