机器学习代写|COMP90051 Statistical Machine Learning Project 1
这是一个数据机器学习的CS代写案例
1 Overview
Text generation has become an increasingly popular task with the rise of natural language processing (NLP) tech- niques and advancements in deep learning. Given a short text prompt written by a human, text generation employs over-parameterised models to generate response text: a likely sequence of text that might follow the prompt, based on enormous training datasets of text from news articles, online libraries of books, and from scraping the web. While text generation has a wide range of applications, including chatbots, language translation, and content creation, it also poses a significant challenge in ensuring content authenticity, accuracy, and authoritativeness. This is where text generation detection comes in, which is the process of identifying whether a given text is machine-generated or human-written. Developing effective text generation detection models is important because it can help prevent the spread of fake news, misinformation, and propaganda.
Your task:
Your task is to predict whether given text input instances have been generated by a human or a machine, given training data (features and labels) and test data (features only). You will participate as part of a group of three students in a Kaggle competition, where you upload your test predictions over the course of the project. Your mark (detailed below) will be based on your test prediction performance and a short report documenting your solution.
You will be provided with training data in two different domains, dataset1 from domain1 and dataset2 from do- main2. Each dataset contains both human-generated and machine-generated text data. The data from different do- mains are collected from different topics. You may choose to use this fact in training. You only need to predict whether an instance is generated by human or machine. The performance of your approach will be evaluated through testing on test data from both domain1 and domain2.
We do not require you to have background experience in NLP, as the datasets have been preprocessed into tokens and mapped to indices in 0,…,83582, with special token 0 for unknown tokens. This means that the data is repre- sented numerically. Your goal is to focus on ML aspects of the task. To get you started, a popular baseline approach for this type of problem is called a bag-of-words model (though you are not required to implement this approach).
There are two key considerations to this task. Firstly, the two datasets are drawn from distinct domains, but you will not be told from which domain a test sample originates. Researchers in machine learning have developed methods for this type of setting, for example, you might want to search online for keywords domain generalisation, domain adaptation, multitask learning, and ensemble learning. Secondly, there is a label imbalance in the training data from domain2; you have fewer human-generated samples. The test set has a balanced label distribution for both domains, so you may want to consider how to achieve good classification accuracy on both classes in such a situation. For example, you might search online for keywords: imbalanced classification, over/under sampling, and data augmentation. We encourage you to begin with a simple approach; we don’t guarantee that complex approaches will perform better. Ultimately, you are measured on your performance on predicting labels from both domain1 and domain2 in the test data.
2 Dataset
2.1 Training data
We have two training datasets: one coming from domain1, and another from domain2. You can think of these do- mains as different data sources or data distributions. Each contains both machine-generated and human-authored samples.
The training data is given in newline delimited JSON format, such that each line contains an instance and each instance is a dictionary with keys:
2.2
- text: the sequence of words, after light preprocessing, where each word has been mapped to an index in {0,…,83582}
- label:abinarylabelwhere0representsmachine-generateddataand1representshuman-generated. Two files are provided:
• domain1.json:5,000samples(2,500ofeachclass).
• domain2.json:13,000samples(1,500human-generatedsamples,11,500AI-generatedsamples).Kaggle Submission Format
The test data consists of 4,000 samples, split evenly between datasets and classes (ie. 1,000 of each class per domain). You will need to submit your predictions on the 4,000 test instances to Kaggle at least once during the project. To ac- complish this, you will place your 4,000 predictions (where 0 and 1 represent machine and human labels, respectively) in a file of a certain format (described next) and upload this to Kaggle.
If your predictions are 0 for first test instance, 1 for the second test instance, and 1 for the third test instance, then your output file should be as follows in CSV format:
id,class 0,0 1,1 2,1
The test set will be used by the Kaggle platform to generate an accuracy measurement for your performance. You may submit test predictions multiple times per day (if you wish). Section 6 describes rules for who may submit—in short you may only submit to Kaggle as a team not individually. During the competition, the accuracy on a 50% subset of the test set will be used to rank you in the public leaderboard. We will use the other 50% of the test set to determine your final accuracy and ranking. The split of the test set during/after the competition is used to discourage you from constructing algorithms that overfit on the leaderboard. The training data, the test set, and a sample submission file “sample.csv” will be available on the Kaggle competition website. In addition to using the competition test data, so as to prevent overfitting, we encourage you to generate your own validation data from the training set and test your algorithms with that validation data.