For assignment 1, we will use a new corpus, “A Million News Headlines” Corpus, covering all the news headlines published on the Australian news source ABC (Australian Broadcasting Corporation, http://www.abc.net.au) over a period of 19 years. The data can be accessed from the following Kaggle page https://www.kaggle.com/datasets/therohk/million-headlines.
You may also learn more details about this dataset and even found some coding examples from the same page. Please use this data to finish the following tasks:
- Train word embeddings using word2vec on this corpus, and perform a sentiment analysis based on the word embeddings and the “positivity” vector. We construct this vector based on the same way as Luca Bellodi (2022):
−−−−−−→ positivity =−−−−−→ success +−−→good +−−−→ happy +−−−−−→ perfect + +−−−−−−−→ important +−−−→ worth +−−→rich − −−−−−→ f ailure − −→bad − −→sad − −−−−−→ terrible − −→bad − −−−−→ regret − −−→poor
- Use the appropriate pre-processing steps that you feel fit;
- Decide on the size of dimensions, number of iterations, and which model you
would like to train;
- Choose a reasonable distance (or similarities) measure;
- Find a reasonable way to aggregate the sentiment scores for each word to the document level.
- Plot the article-level sentiment scores by year-month.
- Try to construct sentiment scores toward different countries or international organizations, such as “US”, “UK”, and “Russia”, “Iran”, “NATO”, and “UN”.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: firstname.lastname@example.org 微信:itcsdx