The goal of this project is to build and critically analyse supervised Machine Learning methods, to predict the sentiment of Tweets. That is, given a tweet, your model(s) will produce a prediction of the sentiment that is present in the tweet. You will be provided with a data set of tweets that have been annotated with positive,negative, and neutral sentiments. The assessment provides you with an opportunity to reflect on concepts in machine learning in the context of an open-ended research problem, and to strengthen your skills in data analysis and problem-solving.
The goal of this assignment is to critically assess the effectiveness of various Machine Learning classification algorithms on the problem of determining a tweet’s sentiment and to express the knowledge that you have gained in a technical report. The technical side of this project will involve applying appropriate machine learning algorithms to the data to solve the task.
The focus of the project will be the report, formatted as a short research paper. In the report, you will demonstrate the knowledge that you have gained, in a manner that is accessible to a reasonably informed reader.
- Report: An anonymous written report, of 1900 (±10%) words (for a group of one person) or 2500 (±10%) words (for a group of two people) including reference list, figure captions and tables. Your name and student ID should not appear anywhere in the report, including the metadata (filename, etc.). Submitted as a single PDF file through Canvas/Turnitin.
- Output: Sentiment predictions for the test instance dataset. Submitted as a single CSV file through Canvas/Turnitin. (You also need to submit your prediction file to the Kaggle1 in-class competition described in section 6.)
- Code: One or more programs, written in Python, including all the code necessary to reproduce the results in your report (model implementation, label prediction, and evaluation). Your code should be executable and have enough comments to make it understandable. Submitted as a zip file through Canvas/Turnitin.
- Reviews of two reports written by other students of 200-300 words each (for a group of one person) or 300-400 words each (for a group of two people).NOTE1: Stage I submissions will be open one week before the due date. Stage II submissions will be open as soon as the reports are available (24 hours following the Stage I submission deadline).NOTE2: If you decided to operate in groups of two ONLY one of you need to register your group via the provided link. Also all submissions (on Canvas and Kaggle) should be done by ONLY one member of the group.
3 Data Set
You are provided with a labelled training set of Tweets, and an unlabelled test set which will be used for final evaluation in the Kaggle in-class competition. In the train set, each row in the data file contains a tweet ID, the tweet text and the sentiment for that tweet2. For example,Tweet_ID, “if i didnt have you i’d never see the sun. #mtvstars lady gaga”, positive
The test dataset has a similar format except the rows do not include a sentiment (label). You are expected to treat each row of the dataset as an instance. For processing these instances, you need to change them to feature vectors. There are many methods for vectorizing textual instances. We have provided you with two examples.
- BoW (Bag of Words)
In the given feature_analysis.ipynb file, you are provided with a basic piece of code that uses the CountVectorizer to transform the train tweets into vectors of Term_IDs and their count. For example,with the use of CountVectorizer the above tweet, will be transformed into the following vector:
[(51027, 1), (44650, 1), (40410, 1), (43384, 1), (22275, 1), (13438, 1), (20604, 1), …]
Where 51027 is the Term_ID for the word ‘you’, 44650 is the Term_ID for the word ‘the’ and so on. You can use and edit this basic code to vectorise your test and train datasets. There are many modifications you can use to experiment with different hypotheses you may have. For example, how ‘removing very frequent and/or very infrequent words’ can affect the behaviour of your Machine Learning models.
There are many more examples.
You are also provided with a basic piece of code that uses TfidfVectorizer to transform the tweets as a vector of values that measure their importance using the following formula:
Where ?!,# is the frequency of term t in document d, ?# in the number of documents containing t, and N is the total number of documents in the collection. You can learn more about TFIDF in (Schutze, 2008).
Using TFIDF the above example tweet will be transformed to the following vector:
[(51027, 0.17), (44650, 0.09), (40410, 0.23), (43384, 0.29), (22275, 0.22), (13438, 0.46), …]
Similar to the Bag of Words method, you can use and edit this basic code to vectorise your test and train datasets. Like above, there are many modifications you can use to experiment with different hypotheses you may have about how changing these features can change the behaviour of your Machine Learning models.
There are many other text vectorization methods that you can use (e.g. word2vec, Bert, etc.). You are welcome and encouraged to use as many vectorization methods as you choose. But please keep in mind that we are more
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx