Python机器学习代写 | Final Project of MSDM5054

本次代写主要为python机器学习分类的project

Data info: the goal is to classify the types of postings based on their context. The dataset is a tiny
version of the 20newsgroups data, with binary occurrence data for 100 key words across 16242 postings
The file “wordlist.txt” lists the 100 key words. The file “documents.txt” is essentially a 16242×100
occurrence matrix where each row is corresponding to 1 posting and each column is corresponding to 1
keyword. The occurrence matrix has binary entries where the (i,j)-th entry is 1 if and only if the i-th posting
contains the j-th keyword. Since the occurrence matrix is extremely sparse, the “documents.txt” is a
sparse representation of the occurrence matrix. Basically, each line in “documents.txt” represents 1 non-
zero entry of the occurrence matrix. For instance, the first line of “documents.txt” is “1 23 1” which means
that the entry (1,23) of the occurrence matrix is 1, i.e., the 1st posting contains the 23th keyword.

The file “newsgroup.txt” has 16242 lines where i-th line stands for the group labels of i-th posting. There
are 4 different groups which means “comp.”, “rec.”, “sci.” and “talk.” respectively. The goal is predict the
type, i.e. 4 different group, of the posting based on the words in this posting.

1. Build a random forest for this dataset and report the 5-fold cross validation value of the
misclassification error. Note that you need to train the model by yourself, i.e., how many predictors
are chosen in each tree and how many trees are used. There is no benchmark. Stop tuning when
you feel appropriate. Report the best CV error, the corresponding confusion matrix and tuning
parameters. What are the ten most important keywords based on variable importance?

2. Build a boosting tree for this dataset and report the 5-fold cross validation value of the
misclassification error. Similarly, report the best CV error, the corresponding confusion matrix and
tuning parameters. Note that the R example in the textbook only considers binary classification.
But the library ‘gbm’ can deal with multi-class case by setting ‘distribution=multinomial’.

3. Compare the results from random forest and boosting trees.

4. Build a multi-class LDA classifier. Report the 5-fold CV error of misclassification and the confusion
matrix.

5. Build a multi-class QDA classifier. Report the 5-fold CV error of misclassification and the confusion
matrix.

6. Compare the performances of all above methods and give your comments.
Part II. Spectral Clustering on 20newsgroup Data
1. Apply PCA on the binary occurrence matrix and apply K-means clustering. Basically, take the top
4 left singular vectors of the occurrence matrix (of size 16242×100) and apply K-means on the
rows of these singular vectors with K=4. Report the mis-clustering error rate.

2. Now take the top 5 left singular vectors of the occurrence matrix and apply K-means on the rows
of these singular vectors with K=4. Report the mis-clustering error rate.

3. Compare with the performances from part I.

 


程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB


blank

本网站支持淘宝 支付宝 微信支付  paypal等等交易。如果不放心可以用淘宝交易!

E-mail: itcsdx@outlook.com  微信:itcsdx


如果您使用手机请先保存二维码,微信识别。如果用电脑,直接掏出手机果断扫描。

blank

发表评论