数据挖掘代写 | CS 5710: Data Mining Program #6

本次美国代写主要为数据挖掘的assignment

This assignment is based on the MicroArray Quality Control (MAQC-II) study that I participated
in as postdoctoral researcher. We will be using endpoints D, E, J, K, L, and M described in Table
1 here: https://cs.appstate.edu/~rmp/cs5710/maqc-ii.pdf
1 The Data
Some of the key challenges of these data are that there are relatively few samples (hundreds) com-
pared to the number of features (tens of thousands), providing an example of the \curse of dimen-
sionality.” Another challenge is that the class labels, although binary, are not balanced. So, there
may be 90% from one class and only 10% from the other. This will depend on the endpoint. You
can nd one exploration of these data in a paper I coauthored:
https://www.nature.com/articles/tpj201056
The training sets we will be using are available here:
https://cs.appstate.edu/~rmp/cs5710/endpoint_d_train.csv
https://cs.appstate.edu/~rmp/cs5710/endpoint_e_train.csv
https://cs.appstate.edu/~rmp/cs5710/endpoint_j_train.csv
https://cs.appstate.edu/~rmp/cs5710/endpoint_k_train.csv
https://cs.appstate.edu/~rmp/cs5710/endpoint_l_train.csv
https://cs.appstate.edu/~rmp/cs5710/endpoint_m_train.csv
2 The Code
You will submit one Python le named gene expression.py that contains six functions. Each
function returns a Pipeline that is ready to be trained using its \ t” method:
# gene_expression.py
def endpoint_d():
# return a Scikit-Learn Pipeline object
return pipeline
def endpoint_e():
# return a Scikit-Learn Pipeline object
return pipeline
def endpoint_j():
# return a Scikit-Learn Pipeline object
return pipeline
def endpoint_k():
# return a Scikit-Learn Pipeline object

return pipeline
def endpoint_l():
# return a Scikit-Learn Pipeline object
return pipeline
def endpoint_m():
# return a Scikit-Learn Pipeline object
return pipeline
3 Model Selection and Hyperparameter Tuning
Use the training data to select your model and tune its hyperparameters. Review chapters 2, 3, and
4 for guiding principles. In addition, Scikit-Learn has a useful \User Guide” for model selection and
evaluation: https://scikit-learn.org/stable/model_selection.html
We will be use \balanced accuracy” to evaluate the models which is the average of sensitivity and
speci city.
You have been provided a training set. Web-CAT will report your performance on a validation set.
Your grade will be determined by your performance on the hidden test set relative to your peers.
4 Submitting on Web-CAT
Web-CAT will use your classi er with a variety of di erent hyperparameters to see if it performs
the same as the Scikit-Learn GaussianMixture model and produces the correct attribute values.
1. Login to Web-CAT here and upload the gene expression.py le to the appropriate assign-
ment: http://webcatvm.cs.appstate.edu:8080/Web-CAT