This assignment is concerned with some kinds of tasks that occur in practical data mining
situations. In this assignment you are asked to apply a number of algorithms to a number
of data sets and write a report on your findings. Your assignment will be assessed on
desmontrated understanding of concepts, algorithms, methodology, analysis of results
and conclusions. Marks are awarded for meeting requirements as closely as possible (see
rubrics). Please make sure your answers are labelled correctly with the corresponding
part and sub-question numbers, to make it easier for the marker to follow. Please stick
to the required page limits (penalty may apply).
2 Learning Outcomes
This assessment relates to the following learning outcomes of the course.
CLO 1: Demonstrate advanced knowledge of data mining concepts and techniques.
CLO 2: Apply the techniques of clustering, classification, association nding, fea-
ture selection and visualisation on real world data.
CLO 4: Apply data mining software and toolkits in a range of applications.
CLO 5: Set up a data mining process for an application, including data preparation,
modelling and evaluation.
CLO 6: Demonstrate knowledge of ethical considerations involved in data mining.
3 Assignment Details
3.1 Part 1: Classification (12 marks)
This part of the assignment is concerned with the file:
The data was supplied by the Garavan Institute and J. Ross Quinlan, NSW, Australia. The
main goal here is to achieve the highest classification accuracy with the lowest amount of over-
1. Run the following classifiers, with the default parameters, on this data: ZeroR, OneR,
J48, IBK and construct a table of the training and cross-validation errors. You can get the
training error by selecting \Use training set” as the test option. What do you conclude
from these results? Provide your explanation.
Run No Classifier Parameters Training Cross-valid Over-
Parameters Error Error Fitting
1 ZeroR None 30.0% 30.0% None
. . . . .
2. Using the J48 classifier, can you find a combination of the C and M parameter values that
minimizes the amount of overfitting? Include the results of your best five runs, including
the parameter values, in your table of results. What is your conclusion?
3. Reset J48 parameters to their default values. What is the elect of lowering the number of
examples in the training set? Provide your explanation. Include your runs in your table
4. Using the IBk classifier, can you find the value of k that minimizes the amount of over-
tting? Provide your explanation. Include your runs in your table of results.
5. Try two other classifiers. Aside from ZeroR, which classifiers are best and worst in terms
of predictive accuracy? Include 5 runs in your table of results. Provide your analysis on
6. Compare the accuracy of ZeroR, OneR and J48. What do you conclude? Give your
explanation on these results.
7. What golden nuggets did you find, if any?
8. [OPTIONAL for COSC2110] Use an attribute selection algorithm to get a reduced at-
tribute set. How does the accuracy on the reduced set compare with the accuracy on the
full set? Provide your explanation.
Report Length: Up to two pages, not including the table of runs.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx