# 大数据代写 | Part 2: Short answer questions

Question 11 (6 marks)

(a) Explain the receiver operating characteristic curve in classification. (2 mark)

(b) Explain the following concepts in hypothesis testing: significance level, p-value, t-statistic, and confidence interval. (2 marks)

(c) Describe the situation for which you will prefer to use Student’s t-test, Welch’s t-test, and Wilcoxon

Rank-Sum test to conduct hypothesis testing, respectively. (2 marks)

Question 12 (5 marks)

(a) Describe the following concepts in association rule mining: support, confidence, lift, and leverage. (2marks)

(b) Answer the following three questions related to the figure below: i) What does each small circle represent? ii) What could the small circles residing on the same “straight line” share? iii) What causes the existence of multiple different “straight lines”? (3 marks)

Question 13 (7 marks)

(a) Explain what the concept “residual” means in linear regression. (2 mark)

(b) After training a linear regression model, you plot the residuals shown in the figure below. Describe what issue you can observe from this figure and how you would deal with this issue。(2marks）

(c) You are building a decision tree in which the output variable is “Play” and one of the attributes is “Weather”. The variable “Play” can take the value of “Yes” or “No” while the attribute “Weather” can take the value of “Rainy”, “Overcast”, or “Sunny”. Based on the probabilities provided in the following two tables, show how you compute the information gain for the attribute “Weather”. (3 marks)

Question 14 (6 marks)

(a) Assume that you are working on a classification problem as a data scientist. It is found that the dataset contains many correlated variables and most of them are categorical variables. Which of the following classifiers would be most suited for modelling this dataset: logistic regression, decision tree, and naïve Bayes classifier. Explain your answer. (3 marks)

(b) A company would like to monitor what is being said about its products in social media. The company is interested in 1) whether people mention its products and 2) what is being said, good or bad. Describe your plan as data scientist for this task. (3 marks)

Question 15 (6 marks)

(a) Image representation plays a key role in image analysis. Name an image representation that is inspired by the bag-of-words model in text analysis and describe how this image representation is obtained for an image. (3 marks)

(b) Describe deep convolutional neural networks for image classification. (3 marks)