CS178: Machine Learning & Data Mining
Kaggle will limit you to at most 2 uploads per day, so you cannot simply upload
every possible classifier and check their leaderboard quality.
Your submission must be a file containing two columns separated by a comma.
The first column should be the instance number (a positive integer), and the
second column is the score for that instance (probability that it equals class +1).
The first line of the file should be “ID,Prob1”, the name of the two columns. We
have released a sample submission file, containing random
predictions, named Y_random.txt.
Forming a Project Team
Students will work in teams of three students to complete the project. We
encourage you to start looking for teammates now; one option is to use the
“Search for Teammates!” page on Piazza
(http://piazza.com/uci/spring2020/cs178) . In exceptional circumstances, if you are
not able to form a team of three students, smaller teams are allowed. However,
the same grading standards are applied to all teams, so smaller teams should
expect a larger workload.
Once you’ve identified your teammates, on the Team tab in Kaggle
(https://www.kaggle.com/c/uci-cs178-spr20/team) , merge with your teammates to
form an integrated team. (We know that merging may make your individual HW4
score disappear from the leaderboard, and will not penalize you for this when
For grading, your project team is defined when your report is uploaded to
gradescope. One team member should upload your pdf to the gradescope site,
and gradescope will then allow that person to select the other team members.
Use the “View or edit group” option on gradescope to be sure this is done
correctly. Do not upload multiple copies of the project report; only one team
member should upload.
Each project team will learn several different classifiers for the Kaggle data, as
well as an ensemble “blend” of them, to try to predict class labels as accurately
as possible. We expect you to experiment with at least three (more is good)
different types of classification models. Suggestions include:
1. K-Nearest Neighbors. KNN models for this data will need to overcome two
issues: the large number of training & test examples, and the data dimension.
As noted in class, distance-based methods often do not work well in high
dimensions, so you may need to perform some kind of feature selection
process to decide which features are most important. Also, computing
distances between all pairs of training and test instances may be too slow;
you may need to reduce the number of training examples somehow (for
example by clustering), or use more efficient algorithms to find nearest
neighbors. Finally, the right “distance” for prediction may not be Euclidean in
the original feature scaling (these are raw numbers); you may want to
experiment with scaling features differently.
2. Linear models. Since you have relatively few input features but a large
amount of training data, you will probably need to define non-linear features
for top performance, for example using polynomials or radial basis functions.
3. Kernel methods. libSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) is one
efficient implementation of SVM training algorithms. But like KNN classifiers,
SVMs (with non-linear kernels) can be challenging to learn from large
datasets, and some data pre-processing or subsampling may be required.
4. Random forests. You will explore decision tree classifiers for this data on
homework 4, and random forests would be a natural way to improve
5. Boosted learners. Use AdaBoost, gradient boosting, or another boosting
algorithm to train a boosted ensemble of some base learner (perceptrons,
shallow decision trees, Gaussian naive Bayes models, etc.).
6. Neural networks. The key to learning a good NN model on these data will be
to ensure that your training algorithm does not become trapped in poor local
optima. You should monitor its performance across backpropagation
iterations on training/validation data, and verify that predictive performance
improves to reasonable values. Start with few layers (2-3) and moderate
numbers of hidden nodes (100-1000) per layer, and verify improvements over
baseline linear models.
7. Other. You tell us! Apply another class of learners, or a variant or
combination of methods like the above. You can use existing libraries or
modify course code. The only requirement is that you understand the model
you are applying, and can clearly explain its properties in the project report.
For each learner, you should do enough work to make sure that it achieves
“reasonable” performance, with accuracy similar to (or better than) baselines
like logistic regression or decision trees. Then, take your best learned models,
and combine them using a blending or stacking technique. This could be done
via a simple average/vote, or a weighted vote based on another learning
algorithm. Feel free to experiment and see what performance gains are possible.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx