2023年11月25日

Python代写 | 301110 Applications of Big Data Assignment

本次Python代写是通过大数据分析来对电影评论进行情感分类，把特征提取到分类

301110 Applications of Big Data
Assignment
3 Task 1. Feature extraction (15 points)
Use the MapReduce model to convert all text data into matrices. Convert ratings
to vectors. ese will be used for classication in Task 2. Use TF-IDF to vectorise
the text les. See previous practical classes and lectures materials for TF-IDF.
One step further though is to represent each text le (review) as a very long
and sparse vector as the following. Assume wordslist is the nal list of distinct
words contained in all reviews and its length is N. en each review will be a
vector of length N, with each position associated with the word in wordlist and
the value being either 0, if the corresponding word is absent in the review, or the
word’s TF-IDF. For example, if wordlist = [‘word1’, ‘word2’, ‘word3’, ‘word4’] and
review 1 contains word1 and word4, then the vector representation of review 1
is [0.1, 0, 0, 0.4] assuming TF-IDF of word 1 and word 4 in review 1 is 0.1 and 0.4
respectively. Note that TF is calculated from one single document while IDF is
obtained from all documents in the collection.
3.1 Requirements:
3.1.1 Req. 1
A Map-reduce model is a must. Implement it using Hadoop streaming. All data
are available on SCEM HDFS. e recommendation is to work on the tiny version
of the data to make the code work. You may try your code on the full version.
However, the application to full version is not required.
3.1.2 Req. 2
Generate two matrices: training data, training targets, and two vectors:
test data, test targets. training data should have N rows and D columns
with each row corresponding to each review in the training set (N is the total
number of reviews in the training set and D is the total number of words). N
and D vary depending on which version of the data you use. training targets
should have N elements each of which is the rating of the review. test data
and test targets are similarly dened.
Notes:
1. If feature extraction is too dicult for you, you can use pre-computed
bag of words features included in this data set. Refer to the appendix and
README le for details. If pre-computed features are used, a 60% penalty
will be incurred for this task, i.e. the maximum marks you can get from
this task is 6 if you do so.
2
2. Using a map-reduce model to extract TF-IDF is mandatory. If not used, a
20% penalty for this task will be incurred. ere is no constraint on how to
form the training and test matrices and vectors. ere are many versions
of TF-IDF. ere is no preference for which version to use.
3. You can use data frame (using pandas package) instead of matrices and
vectors to store training and test data and targets.
Marking scheme for task 1:
• Text le reading (1pt): read the text les for TF-IDF extraction.
• Rating scores extraction (3pts): parse the name of text les to extract ratings.
• TF-IDF extraction (8pts): use MapReduce class to extract TF-IDF for each
text le.
• Forming matrices and target vectors (or data frames) (3pts): collect TF-IDFs
to form training and test data for task 2.
4 Task 2. Classication (15 points)
Construct a classication model for review sentiment prediction, meaning that:
given a customer movie review (taken from the test set), your program should be
able to predict whether it is positive or negative.
ere is no limitation on how many classiers and what specic model you
should use. You can simply pick one that works for you for this task, either
from those covered in lecture and practical class materials or any other classi-
ers from any python packages. A good starting point is the scikit-learn (i.e.
sklearn) package.
A few things you need to address in your python program are listed as requirements below.
4.1 Requirements:
4.1.1 Req. 1
Data pre-processing. In task 1, you extracted the ratings vectors for training
and test. ese are raw ratings. As we are interested in sentiment prediction,
i.e. to predict either the review is positive or negative, you need to convert all
ratings >5 as positive class and all ratings <=5 as negative class. Choose a
coding scheme, e.g. 0 for positive, 1 for negative.
3
4.1.2 Req. 2
Normalisation. Apply at least one normalisation scheme and compare the performance of the classier(s) with and without normalisation.
4.1.3 Req. 3
Training and model selection. Use cross validation to select the best parameters
for your classier. ere may be many parameters to tune in some classiers
(such as random forest classier — RFC). You can focus on the most important
one(s) such as max depth and n estimators in RFC. Refer to the scikit-learn
package documentation for details.
Hint: you can start with a small subset of the training set to test a few parameters
to get a feel of what range the parameters should be that make the model perform
well in terms of prediction accuracy. en turn on large scale cross validation on
the whole training set.
4.1.4 Req. 4
Test on test data. Aer model selection, apply the best model, i.e. the model with
the parameters that produce the best cross validation scores, to test data, make
a prediction for each review, and record prediction accuracy
. Note:
1. Always train your classier(s) ONLY on training data including cross validation. Aer model selection, apply the best model on test data to evaluate
the performance.
2. Good performance, i.e. higher accuracy on test data, is not essential for
this task. However, if your classier has accuracy lower than about 60%, it
usually means that there are some mistakes somewhere in your code. So
try to score as high an accuracy as possible.
3. You are encouraged to try many classiers. If the coding is right, this
should not be too dicult.
Marking scheme for task 2:
• Data pre-processing (1pts): convert ratings to positive and negative coding
scheme.
• Normalisation and comparison (3pts): apply normalisation and compare
performance dierence with and without it.

程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB

CS代写,留学生编程代写,CS作业代写,Java代写,程序代写，代码代写 | ITCS代写

本网站支持淘宝支付宝微信支付 paypal等等交易。如果不放心可以用淘宝交易！

E-mail:itcsdx@outlook.com 微信:itcsdx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。

Python代写

工程代写｜MSc Introductory Module Assignment: The Responsible Engineer 数据库代写 | CSE2/4DBF-Assignment

CONTACT

Assignment Example

Service Scope

Recent Case

2024年10月8日

ITCS代写

Python代写 | 301110 Applications of Big Data Assignment

CONTACT

Assignment Example

Service Scope

Recent Case

MySQL数据库学习指南：留学生如何在不同国家的课程和就业形势下脱颖而出

北美计算机留学高校整理与热门专业前景分析

留学生计算机代写常见服务有哪些？

留学生程序代写靠谱吗

留学生如何选择机器学习方向的专业

Tags