The goal of this project is to develop one (or more) machine learning systems that operate on the given real-world dataset(s). In doing so, you will apply tools and techniques that we have covered in class. You may also confront and solve issues that occur in practice and that have not been covered in class. Discussion sessions and past homework problems will provide you with some tips and pointers for this; also internet searches and piazza discussions will likely be helpful.
Projects can be individual (one student working on the project) or a team (2 students working together on one project).
You will have significant freedom in designing what you will do for your project.
You must cover various topics that are listed below (as “Required Elements”); the methods you use, and the degree of depth you go into each topic, are up to you. And,you are encouraged to do more than just the required elements (or to dive deeper than required on some of the required elements).
Everyone will choose their project topics based on the three topic datasets listed
Collaboration and comparing notes on piazza may be helpful, and looking for pertinent information on the internet can be useful. However each student or team is required to do their own work, coding, and write up their own results.
There are 3 topic datasets to choose from:
(i)Power consumption of Tetouan City
Problem type: regression
(ii)Student performance (in Portuguese schools)
Problem type: classification or regression
(iii)Algerian forest fires
Problem type: classification
These datasets are described in the appendix, below.
Note: for each topic dataset, you are required to use the training and test sets
provided on D2L. Some of the features have been preprocessed (e.g., normalization,transformation, deletion, noise removal or addition). Additionally, we want everyone to use the same training set and test set, so that everyone’s system can be compared by the same criteria.
Which topic datasets can you use?
Individual projects should choose any one dataset. If it is the student performance dataset, then pick either the classification or regression version.
Team projects should choose any one dataset. If it is the student performance dataset,then do both classification and regression versions. If it is the forest fire dataset or the power consumption dataset, then you are expected explore the topic in more depth than a typical individual project. For example, this could mean doing more feature engineering, trying more (nonlinear) feature expansion/reduction, or trying more models.
Computer languages and available code
You must use Python as your primary language for the project. You may use Python and its built-in functions, NumPy, scikit-learn, and matplotlib. You may find LibSVM or SVMLight useful, and may use them as well. Additionally, you may use imblearn (imbalanced-learn) functions for undersampling or oversampling of your data if that would be useful for your project; you may use pandas only for reading, writing, and parsing csv files; and you may use a function or class for RBF network implementation(e.g., scipy.interpolate.Rbf). Within these guidelines, you may use any library functions or methods, and you may write your own code, functions, and methods.
Please note that for library routines (functions, methods, classes) you use, it is your responsibility to know what the routine does, what its parameters mean, and that you are setting the parameters and options correctly for what you want the routine to do.
Use of C/C++ is generally discouraged (for your own time commitments and because we didn’t cover it in relation to ML). However, there could be some valid reasons to use C/C++ for some portions of your code, e.g. for faster runtime. If you want to use it, we recommend you check with the TAs or instructor first.
Be sure to state in your project report what languages and toolboxes/libraries you used;what you coded yourself specifically for this class project; and of course any code you use from other sources must be credited as such.
- The items below give the minimal set of items you are required to include in
your project, for each dataset you report on. Note that you are welcome and encouraged to do more than the minimal required elements (for example, where you are required to use one method, you are welcome to try more than one method and compare the results). Doing more work will increase your workload score,might increase your interpretation score, and might improve your final system’s performance.
- EE 559 content
o The majority of your work must use algorithms or methods from EE 559 (covered in any part of the semester).
o You may also (optionally) try algorithms and methods that were not covered in EE 559 for comparison; and describe the method and results in your report.
- Consider preprocessing: use if appropriate [Discussion 7, 8, 9]
◦ Tip: you might find it easier to let Python (using pandas) handle csv parsing.
◦ Normalization or standardization. It is generally good practice to consider these.
It is often beneficial if different features have significantly different ranges of values. Normalization or standardization doesn’t have to be all features or none;for example, binary variables are typically not standardized.
◦ For classification problems, if the dataset is significantly unbalanced, then some methods for dealing with that should be included. If it’s moderately unbalanced,then it might be worth trying some methods to see if they improve the performance. Some approaches to this are done by preprocessing of the data.
◦ Representation of categorical (ordinal or cardinal) input data should be considered. Non-binary categorical-valued features usually should be changed to a different representation.
- Consider feature-space dimensionality adjustment: use if appropriate
◦ You can use a method to try reducing and/or expanding the dimensionality, and to choose a good dimensionality. Use the d.o.f. and constraints as an initial guide on what range of dimensionality to try.
◦ In addition to feature-reduction methods we covered in EE 559, feel free to try others (some others are mentioned in the forest fires dataset description).
- Cross validation or validation
◦ Generally it’s best to use cross validation for choosing parameter values,comparing different models or classifiers, and/or for dimensionality adjustment.
If you have lots of data and you’re limited by computation time, you might instead use validation without cross-validation.
- Training and prediction
◦ Individual projects should try at least 3 different classification/regression techniques that we have covered (or will cover) in class. Team projects should cover at least 4 classification/regression techniques, at least 3 of which are covered in EE 559. Beyond this, feel free to optionally try other methods.
Note that the required trivial and baseline systems don’t count toward the 3 or 4 required classifiers or regressors, unless substantial additional work is done to optimize it to make it one of your chosen systems.
- Proper dataset (and subset) usage
◦ Final test set (as given), training set, validation sets, cross validation.
- Interpretation and analysis
◦ Explain the reasoning behind your approach. For example, how did you decide whether to do normalization and standardization? And if you did use it, what difference did it make in system performance? Can you explain why?
◦ Analyze and interpret intermediate results and final results. Especially, if some results seem surprising or weren’t what you expected. Can you explain (or hypothesize reasons for) what you observe?
◦ (Optional) If you hypothesize a reason, what could you run to verify or refute the hypothesis? Try running it and see.
◦ (Optional) Consider what would be helpful if one were to collect new data, or collect additional data, to make the prediction problem give better results. Or,what else could be done to potentially improve the performance. Suggest this in your report and justify why it could be helpful.
- Reference systems and comparison
◦ At least one trivial system and one baseline system are required. Each dataset description states what to use for these systems.
◦ Run, and then compare with, the baseline system(s). The goal is to see how much your systems can improve over the baseline’s system performance.
◦ Also run, and compare with, the trivial system. The trivial system doesn’t look at the input values ? for each prediction; its comparison helps you assess whether your systems have learned anything at all.
- Performance evaluation
o Report on the cross-validation (or validation) performance (mean and standard deviation); it is recommended to use one or more of the required performance measures stated in your dataset’s description (in the Appendix, below). If you use other measure(s), justify in your report why.
o Report the test-set performance of your final (best-model) system.
▪ For your final-system test-set performance, you may use the best parameters found from model selection, and re-train your final system using all the training data to get the final weight vector(s).
▪ Report on all required performance measures listed in the dataset description.
o You may also report other measures if you like. Use appropriate measures for your dataset and problem.
- Written final report and code
◦ Submit by uploading your report and code in the formats specified below, to D2L.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx