Python代写 | BigData Analysis Homework 2 CSE 482

本次Python代写Big Data主要要求学生完成对数据集的清洗,同时在后半部分主要使用决策树或者回归预测等算法实现数据集的分析预测。

This Python Assignment Writing Service for Big Data mainly requires students to complete the cleaning of the data set. At the same time, in the latter half, the decision tree or regression prediction algorithm is mainly used to realize the analysis and prediction of the data set.

CSE 482: Big Data Analysis (Spring 2019) Homework 2

Due date: February 20, 2019 (before midnight)

Submit your homework using the D2L system. Use the notebook below to write the solution of your homework. Make sure you submit the notebook along with its HTML version.

1. Consider the dataset shown in the table below.

Note that x1 through x9 are integer-valued counts sorted in ascending order (i.e., x1 corresponds to the lowest cell count while x9 has the highest cell count). Suppose we apply the following methods (equal interval width, equal frequency, and entropy-based) to discretize the blood cell count attribute into 3 bins. The bins obtained are listed below:

- Equal Width: 
    - Bin 1: x1, x2  
    - Bin 2: x3, x4, x5, x6, x7, x8
    - Bin 3: x9

- Equal Frequency: 
    - Bin 1: x1, x2, x3 
    - Bin 2: x4, x5, x6 
    - Bin 3: x7, x8, x9

- Entropy-based discretization with smoking status as class attribute: 
    - Bin 1: x1, x2
    - Bin 2: x3, x4, x5
    - Bin 3: x6, x7, x8, x9

Explain the effect of applying each transformation below on the discretization methods listed above. Specifically, state whether the elements assigned to the bins can change to a different bin if you apply discretization on the transformed attribute values.

(a) Centering the attribute: xxmx→x−m

(b) Standardizing the attribute: xxmsx→x−ms

(c) Applying logarithmic transform: xlog(x)x→log⁡(x)

where x corresponds to one of the original blood count values (x1 to x9), m denotes the mean (average) value of the 9 numbers, and s denotes the standard deviation of the 9 numbers. Note: you do not need to know the exact values of x1 to x9 in order to answer this question.


(a) Centering the attribute

 i. Equal-width:

 ii. Equal frequency:

 iii. Entropy-based:

(b) Standardizing the attribute

 i. Equal-width:

 ii. Equal frequency:

 iii. Entropy-based:

(c) Applying log transform:

 i. Equal-width:

 ii. Equal frequency:

 iii. Entropy-based:

Note: You do not have to list the elements assigned to each bin after the transformation and discretization. You only need to answer whether it is possible for some of the elements to change their bin membership after the transformation and discretization. For example, suppose x3 was originally assigned to bin #2 using the equal width method. After applying the transformation, its value is converted to x3′. After applying equal width discretization on the transformed values, suppose x3′ was assigned to bin #1. In this case, your answer for the equal-width method should be “Yes it is affected by the transformation because …”. However, if x3′ remains in bin #2 after it was discretized, then your answer should be “No it is not affected by the transformation because …”

2. Consider the following 2-week bike rental dataset:

The data set contains missing values for the weight attribute (denoted as ? in the table). Compare the following three approaches for imputing the missing values:

Approach 1: Discard the missing values.

Approach 2: Replace the missing value with the global mean (i.e., average number of rentals for all the non-missing days).

Approach 3: Replace the missing value with the stratified mean. For example, if the missing value is on a weekday, replace it by the average number of rentals for all non-missing weekdays.

(a) What are the imputed values for day 1 and day 12 using approaches 2 and 3 described above?


(b) Suppose we are interested in calculating the average number of rentals for all days (weekdays and weekends). Which approach, 2 or 3, will give the same average number of rentals for all days as approach 1?


(c) Which of the three approaches is the best approach to deal with the missing value problem shown above. State your reasons clearly.


(d) Give a scenario in which approach 1 would be the best way to deal with the missing value problem.


3. In this exercise, you need to write a Python function that will implement the reservoir sampling approach described in class. Your Python function should take three input arguments: name of the input file, sample size (n), and seed (random_state) for the random number generator. The function should return the sample data as a data frame object. For this question, you can use the wiki_edit.txt file from lecture 4 as the input file. Set the sample size to be 10 and the random seed to be 1.


In [ ]:
import pandas as pd

def reservoir_sampling(inputFile, n, random_state):
    """This function performs reservoir sampling from the given input file.
    The function will return a dataframe object with n rows of records randomly
    sampled with uniform probability from the input file."""
    with open(inputFile,'r') as f:
In [ ]:
sample = reservoir_sampling('wiki_edit.txt', 10, 1)


5. Decision tree construction.

(a) Draw a decision tree that would perfectly classify the dataset shown below. The dataset has 2 predictor attributes, denoted as x1 and x2, which were partitioned into 3 classes (denoted as A, B, and C). You can draw the tree using any software you want (e.g., powerpoint), save it as a jpeg/bmp/png image, and attach it to the notebook.

Solution: Attach your decision tree figure here.

(b) Consider the following training data for predicting whether there will be traffic congestion on a given segment of an interstate highway. Each data point corresponds to a particular time of day and is classified either as positive (if the highway segment was congested) or negative (if it was not congested) class. Suppose you are interested in building a decision tree classifier on the training data. Compute the overall Gini index for each predictor attribute (construction and weather condition) and select the best attribute to partition the training data.


6. Classifier Evaluation

(a) Consider the decision tree shown below for classifying a dataset that contains examples that belong to 2 classes (positive or negative). The distribution of training examples that were assigned to each leaf node is also shown in the diagram. Draw a 2 ×× 2 confusion matrix of the tree.


(b) Calculate the training error rate of the decision tree.


(c) Calculate the F1-measure of the tree for the training data.



本网站支持淘宝 支付宝 微信支付  paypal等等交易。如果不放心可以用淘宝交易!

E-mail: [email protected]  微信:itcsdx