Data info: the goal is to classify the types of postings based on their context. The dataset is a tiny
version of the 20newsgroups data, with binary occurrence data for 100 key words across 16242 postings
The ﬁle “wordlist.txt” lists the 100 key words. The ﬁle “documents.txt” is essentially a 16242×100
occurrence matrix where each row is corresponding to 1 posting and each column is corresponding to 1
keyword. The occurrence matrix has binary entries where the (i,j)-th entry is 1 if and only if the i-th posting
contains the j-th keyword. Since the occurrence matrix is extremely sparse, the “documents.txt” is a
sparse representation of the occurrence matrix. Basically, each line in “documents.txt” represents 1 non-
zero entry of the occurrence matrix. For instance, the ﬁrst line of “documents.txt” is “1 23 1” which means
that the entry (1,23) of the occurrence matrix is 1, i.e., the 1st posting contains the 23th keyword.
The ﬁle “newsgroup.txt” has 16242 lines where i-th line stands for the group labels of i-th posting. There
are 4 diﬀerent groups which means “comp.”, “rec.”, “sci.” and “talk.” respectively. The goal is predict the
type, i.e. 4 diﬀerent group, of the posting based on the words in this posting.
1. Build a random forest for this dataset and report the 5-fold cross validation value of the
misclassiﬁcation error. Note that you need to train the model by yourself, i.e., how many predictors
are chosen in each tree and how many trees are used. There is no benchmark. Stop tuning when
you feel appropriate. Report the best CV error, the corresponding confusion matrix and tuning
parameters. What are the ten most important keywords based on variable importance?
2. Build a boosting tree for this dataset and report the 5-fold cross validation value of the
misclassiﬁcation error. Similarly, report the best CV error, the corresponding confusion matrix and
tuning parameters. Note that the R example in the textbook only considers binary classiﬁcation.
But the library ‘gbm’ can deal with multi-class case by setting ‘distribution=multinomial’.
3. Compare the results from random forest and boosting trees.
4. Build a multi-class LDA classiﬁer. Report the 5-fold CV error of misclassiﬁcation and the confusion
5. Build a multi-class QDA classiﬁer. Report the 5-fold CV error of misclassiﬁcation and the confusion
6. Compare the performances of all above methods and give your comments.
Part II. Spectral Clustering on 20newsgroup Data
1. Apply PCA on the binary occurrence matrix and apply K-means clustering. Basically, take the top
4 left singular vectors of the occurrence matrix (of size 16242×100) and apply K-means on the
rows of these singular vectors with K=4. Report the mis-clustering error rate.
2. Now take the top 5 left singular vectors of the occurrence matrix and apply K-means on the rows
of these singular vectors with K=4. Report the mis-clustering error rate.
3. Compare with the performances from part I.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx