Java代写WEKA | CS699 A1 Spring 2021 Project Assignment

本次主要是使用Java WEKA进行数据挖掘代写

CS699 A1 – Spring 2021 Project Assignment

There are three options. You must choose one of these options.

The project must be performed by a team of two students. Every team must present their project.

The first thing you need to do is to form a team and let me know your team members by 2/1. Since most of you are attending the class remotely, it may not be easy to find a project partner. If you cannot find a partner and don’t let me know your team members by 2/1, I will randomly assign team members.

Once you for a team, you must choose and let me know your project option by 2/8.

Option 1
The goal of option 1 is to give students an opportunity to perform a classification data

mining task.

You choose a real world dataset, define your own data mining goal (a classification), and perform necessary data mining tasks to achieve the goal. It is strongly suggested that you choose a data mining goal that has a potential for practical use. It is also strongly suggested that you find a “fresh” dataset, which, to the best of your knowledge, was rarely used by other people. You should not select a synthetically generated dataset and you should not use a dataset from UCI Machine Learning Repository. You must also avoid using a dataset on the Kaggle website that has been used by many people. You may want to check government (federal, state, or municipal) websites.

Once you build data mining models, you must evaluate the data mining result using appropriate performance measures.

The following specifies minimum requirements. You can choose a larger dataset and you can perform additional tasks not mentioned in the requirements if you want.

  •   The project must be “classification.”
  •   Dataset minimum requirements o At least 20 attributes

    o At least 300 tuples

    If you are interested in a certain dataset but it does not meet the above requirements, then indicate that in your proposal. I will review it and may approve it.

  •   Data mining minimum requirements

o You need to consider at least four attribute selection methods that are implemented on Weka plus a set of attributes chosen by yourself.

o You need to build classifier models using at least five different classifier algorithms for each chosen set of attributes. So, you need to build and test total at least 25 classifier models.

o You may try any data preprocessing/preparation/transformation to increase the performance of your classifier models.

 Model testing
o Once you complete data preprocessing, you must split your dataset into a

training dataset and a test dataset. You must make sure that the class

distribution is preserved in both datasets.
o You build your models from the training dataset and you test your models

on the test dataset.
The following diagram is a simplified illustration of the above process:

Training dataset

Project‐training

m attributes
r tuples
r = about 66 % of n

Attribute selection method selects
k attributes

Classification algorithm

Model

Split

Must be stratified

Test dataset

Project‐test

m attributes
s tuples
s = about 34 % of n

Select the same k attributes

Reduced training dataset

k (< m) attributes r tuples

Initial dataset (after preprocessing) D

m attributes n tuples

Reduced test dataset

k attributes s tuples

Model is tested on reduced test dataset

1. Performance comparison
o Compare performance of all 25 classifier models you built using the

following performance measures: accuracy, TP rates, FP rates, ROC curve

(or area under curve), and other measures if you want.
o Choose one model that you think is the best for your data mining goal.

You need to justify why you chose that model. Schedule and Deliverables

(Only one member of each team needs to submit deliverables)

  1. Proposal
    1. Due: 2/16
    2. Include the names of your team.
    3. Dataset description: You must include the source of your dataset and detailed

      description of it. Your description must include the names and meanings of all

      attributes as well as the number of tuples and the number of attributes.

    4. Clearly state your data mining goal (e.g., I want to predict whether a new

      customer will buy a computer or not).

    5. Clearly indicate which attribute is the class attribute.
    6. You also need to submit your dataset.
  2. Progress report
    1. Due: 3/8
    2. You must submit detailed description of what you did by this time.
  3. Final project report due: 3/29

    You must submit all project documentation as described below. This is a hard

    deadline and there will be a 10% late penalty per day after the deadline.

  4. Project report

a. A project report should include:

  1. (1)  Cover page
  2. (2)  Statement of your data mining goal
  3. (3)  Detailed description of the dataset
  4. (4)  Detailed description of data mining tool(s) or algorithm(s) you used
  5. (5)  Detailed description of data mining procedure (the procedure you actually

    followed) including all data preprocessing you performed. You must show the attributes selected by each attribute selection method and the attributes you chose.

  6. (6)  Data mining result and evaluation:
    1. You must include all performance measures, including confusion

      matrices, from Weka’s output window for all 25 models.

    2. You must present your result using tables, graphs, charts, or in

      other visual format so that readers of your report can easily and

      effectively understand your result.

    3. Justification for your selection of the best model
  7. (7)  Discussion and conclusion, including what you learned from this project.

  1. In your report, you must clearly state what each team member did for this project.
  2. Your report must be at least 10 pages long (with 12pt font and single spaced).

5. When
a. Initial dataset

you submit your project report, you also need to submit all datasets, including

  1. The dataset after preprocessing
  2. Initial training and test datasets
  3. The training and test datasets that were used for your best model
  4. Other intermediate dataset(s) if needed

6. Other
which will be determined after I have more information about your project.

deliverables may be required based on the nature of your individual project,

7. Presentation:

  1. Each team will have 15 – 20 minutes for presentation.
  2. All students must be present in the class during the presentations. If you do

    not attend a presentation (when other teams are presenting), 3 points will be deducted for each missed presentation.

Grading

  •   Project overall and project report: 70 points
  •   Presentation: 20 points
  •   Participation: 10 points

    Project overall and report (70)

  •   Project report is due 3/29. There is no grace period and there will be a late penalty of 10 points per day if you submit late.
  •   Whether the data mining result is practically usable. If your dataset was not used by other people (to the best of your and my knowledge), your project has potential for some practical use, and the performance of your model is reasonably good, then you may get an extra credit up to 10 points.
  •   Technical soundness of your approach. Otherwise, up to 10 points will be deducted.
  •   The performance of your best classification model. Note that there is no performance threshold which is used to grade your project. This is because different datasets and different data mining goals can result in different performance. I will use my own judgement considering your dataset and your data mining goal. If the performance of your models is very low (e.g., 60% or lower accuracy), then you must try to increase the performance and/or try to explain why it is so low. If you do not address such a low performance in one way or another, up to 10 points will be deducted.
  •   Whether all necessary components are included in the documentation. Otherwise, up to 15 points will be deducted.
  •   Organization of your documentation. If your documentation is poorly organized, up to 10 points will be deducted.

  •   If your results are not effectively presented using tables, graphs, or charts, up to 10 points will be deducted.
  •   Whether your discussion and conclusion is substantive and technically and logically sound. Otherwise, up to 10 points will be deducted.
  •   Progress report grading:
    •   10 points will be deducted if you do not submit a progress report
    •   Up to 6 points will be deducted if your progress report does not include

      detailed description of what you did. Presentation (20)

  •   Presentations will be done on 4/12, 4/21, and 4/26
  •   The order of presentation will be determined randomly.
  •   Presentation slides are due as follows:

    o Teams presenting on 4/12: 4/9 (Friday)
    o Teams presenting on 4/21: 4/16 (Friday)
    o Teams presenting on 4/26: 4/23 (Friday)
    o If you submit late, there will be 1 point late penalty per day.

    Your presentation will be graded based on the following criteria.

  •   Whether the presentation accurately represents what you did. Otherwise, up to 3 points will be deducted.
  •   Whether presentation material is well organized in describing what you did. Otherwise, up to 3 points will be deducted.
  •   Whether graphs and/or tables were well utilized to present the result. Otherwise, up to 3 points will be deducted.
  •   Whether questions are properly answered. Otherwise, up to 3 points will be deducted.

    Participation (10)

 If a student misses a presentation, 3 points will be deducted for each missed presentation.

Important

It is very important that I should be able to reproduce your data mining model and data mining result based on your documentation. So, the description of your data mining procedure, including all preprocessing you performed, must be detailed and accurate. If I cannot reproduce your model and result, you will lose up to 40 points.

Option 2

Option 2 is an experiment to determine whether a bagging method and a boosting method increase the performance of classifier models. Follow the instruction given below.

  •   Select 20 datasets for classification.
  •   Select 5 classification algorithms.
  •   For each dataset D and each classifier algorithm A, perform the following:
    •   Run A on D, with 10-fold cross-validation chosen as the test method, and collect the following performance measures: TPR, FPR, F-measure, and AUC.
    •   Run Bagging with A on D, with 10-fold cross-validation chosen as the test method, and collect the following performance measures: TPR, FPR, F-measure, and AUC for each class.
    •   Run AdaBoostM1 with A on D, with 10-fold cross-validation chosen as the test method, and collect the following performance measures: TPR, FPR, F-measure, and AUC for each class.
  •   You must repeat the above 100 times (20 datasets x 5 classifier algorithms).
  •   Then, organize your result, present your result (as a table, graph, or any other format),

    and draw your conclusion. Try to be creative when you present your result so that your result may be effectively conveyed to readers of your report. Remember that your goal is to determine whether those ensemble methods increase classifier performance.

    Schedule and Deliverables

    (Only one member of each team needs to submit deliverables)

    1. Proposal

    •   Due: 2/16
    •   Submit all datasets you chose.
    •   Description of all datasets:

      For each dataset, you must include:
      o The name of the dataset
      o The number of tuples and the number of attributes o Names and meanings of all attributes
      o Name of the class attribute and class distribution o Source of the dataset

    •   Names of the classification algorithms you chose
    1. Progress report
      •   Due: 3/8
      •   Submit detailed description of what you did by this time.
    2. Project report
      •   Due: 3/29
      •   Y our project must include:

 Cover page

  •   If you performed any preprocessing on any dataset, you need to describe in detail the preprocessing you performed and you also need to submit the final dataset that was created after the preprocessing.
  •   Result of the experiment: You need to present your result using tables, graphs, charts, or in other visual format so that readers of your report can easily and effectively understand your result.
  •   Discussion and conclusion Grading
  •   Project overall and project report: 70 points
  •   Presentation: 20 points
  •   Participation: 10 points

    Project overall and report (70)

  •   Project report is due 3/29. There is no grace period and there will be a late penalty of 10 points per day if you submit late.
  •   Progress report grading:
    •   10 points will be deducted if you do not submit a progress report
    •   Up to 6 points will be deducted if your progress report does not include

      detailed description of what you did.

  •   If the whole or part of the experiment is not technically sound/correct, up to 20

    points will be deducted.

  •   Whether all necessary components are included in the documentation. Otherwise,

    up to 15 points will be deducted.

  •   Organization of your documentation. If your documentation is poorly organized,

    up to 10 points will be deducted.

  •   Whether your discussion and conclusion is substantive and technically and

    logically sound. Otherwise, up to 10 points will be deducted.

  •   If the presentation of your result is considered “excellent” you will get extra 10

    points. Presentation (20):

  •   Schedule: Same as Option 1
  •   Your presentation will be graded based on the following criteria.
    •   Whether the presentation accurately represents what you did. Otherwise, up to 3 points will be deducted.
    •   Whether presentation material is well organized in describing what you did. Otherwise, up to 3 points will be deducted.

  •   Whether graphs and/or tables were well utilized to present the result. Otherwise, up to 3 points will be deducted.
  •   Whether questions are properly answered. Otherwise, up to 3 points will be deducted.

    Participation (10)

 If a student misses a presentation, 3 points will be deducted for each missed presentation.

Option 3

This option is an experiment to compare an undersampling method and an oversampling method to handle unbalanced datasets. Follow the instruction given below.

  •   Select at least 10 unbalanced datasets for classification. Make sure that the class attribute is a binary attribute and the fraction of the minority class is no more than 20%.
  •   Select 5 classification algorithms.
  •   For each dataset D and each classifier algorithm A, perform the following:
    •   Split D into a training dataset Dtr and a test dataset Dts. Use about 2/3 as the training dataset and 1/3 as the test dataset. Make sure that the class distribution is preserved.
    •   Build a classifier model using the algorithm A from the training dataset Dtr, and test the model on the test dataset Dts, and collect the following performance measures: TPR, FPR, F-measure, and AUC for each class.
    •   From the training dataset Dtr, create an undersampled dataset Dtr-us.
    •   Build a classifier model using the algorithm A from the undersampled training

      dataset Dtr-us, and test the model on the test dataset Dts, and collect the following

      performance measures: TPR, FPR, F-measure, and AUC for each class.

    •   From the training dataset Dtr, create an oversampled dataset Dtr-os.
    •   Build a classifier model using the algorithm A from the overrsampled training

      dataset Dtr-os, and test the model on the test dataset Dts, and collect the following

      performance measures: TPR, FPR, F-measure, and AUC for each class.

  •   You must repeat the above at least 50 times (at least 10 datasets x 5 classifier

    algorithms).

  •   Then, organize your result, present your result (as a table, graph, or any other format),

    and draw your conclusion. Try to be creative when you present your result so that your result may be effectively conveyed to readers of your report. Remember that your goal is to determine whether undersampling is better or oversampling is better for unbalanced dataset.

  •   You may try other methods to address the issue of unbalanced dataset for classification.

Schedule and Deliverables

(Only one member of each team needs to submit deliverables)

1. Proposal

  •   Due: 2/16
  •   Submit all datasets you chose.
  •   Description of all datasets:

    For each dataset, you must include:

o The name of the dataset
o The number of tuples and the number of attributes
o Names and meanings of all attributes
o Name of the class attribute
o Show which class is the minority class and which class is the majority

class, and also show the ratio of the two. o Source of the dataset

 Names

of the classification algorithms you chose

  1. Progress report
    •   Due: 3/8
    •   Submit detailed description of what you did so far
  2. Project report
    •   Due: 3/29
    •   Y our project must include:
      •   Cover page
      •   If you performed any preprocessing on any dataset, you need to describe in

        detail the preprocessing you performed and you also need to submit the final

        dataset that was created after the preprocessing.

      •   Result of the experiment: You need to present your result using tables, graphs,

        charts, or in other visual format so that readers of your report can easily and

        effectively understand your result. Discussion and conclusion

        Grading

    •   Project overall and project report: 70 points
    •   Presentation: 20 points
    •   Participation: 10 points

      Project overall and report (70)

    •   Project report is due 3/29. There is no grace period and there will be a late penalty of 10 points per day if you submit late.
    •   Progress report grading:

 10 points will be deducted if you do not submit a progress report

 Up to 6 points will be deducted if your progress report does not include detailed description of what you did.

  •   If the whole or part of the experiment is not technically sound/correct, up to 20 points will be deducted.
  •   Whether all necessary components are included in the project report. Otherwise, up to 15 points will be deducted.
  •   Organization of your documentation. If your documentation is poorly organized, up to 10 points will be deducted.
  •   Whether your discussion and conclusion is substantive and technically and logically sound. Otherwise, up to 10 points will be deducted.
  •   If the presentation of your result is considered “excellent” you will get extra 10 points.

    Presentation (20)

  •   Schedule: Same as Option 1
  •   Your presentation will be graded based on the following criteria.
    •   Whether the presentation accurately represents what you did. Otherwise, up to 3 points will be deducted.
    •   Whether presentation material is well organized in describing what you did. Otherwise, up to 3 points will be deducted.
    •   Whether graphs and/or tables were well utilized to present the result. Otherwise, up to 3 points will be deducted.
    •   Whether questions are properly answered. Otherwise, up to 3 points will be deducted.

      Participation (10)

 If a student misses a presentation, 3 points will be deducted for each missed presentation.