CSE 6242 / CX 4242: Data and Visual Analytics
HW 4: PageRank Algorithm, Random Forest, Scikit-learn
Data analytics and machine learning both revolve around using computational models to capture relationships
between variables and outcomes. In this assignment, you will code and fit a range of well-known models from
scratch and learn to use a popular Python library for machine learning.
In Q1, you will implement the famous PageRank algorithm from scratch. PageRank can be thought of as a
model for a system in which a person is surfing the web by choosing uniformly at random a link to click on at
each successive webpage they visit. Assuming this is how we surf the web, what is the probability that we
are on a particular webpage at any given moment? The PageRank algorithm assigns values to each webpage
according to this probability distribution.
In Q2, you will implement Random Forests, a very common and widely successful classification model, from
scratch. Random Forest classifiers also describe probability distributions—the conditional probability of a
sample belonging to a particular class given some or all of its features.
Finally, in Q3, you will use the Python scikit-learn library to specify and fit a variety of supervised and
unsupervised machine learning models.
Q1 [20 pts] Implementation of Page Rank Algorithm
Note: You must use Python 3.7.x for this question.
In this question, you will implement the PageRank algorithm in Python for a large graph network dataset.
The PageRank algorithm was first proposed to rank web pages in search results. The basic assumption is
that more “important” web pages are referenced more often by other pages and thus are ranked higher. The
algorithm works by considering the number and “importance” of links pointing to a page, to estimate how
important that page is. PageRank outputs a probability distribution over all web pages, representing the
likelihood that a person randomly surfing the web (randomly clicking on links) would arrive at those pages.
As mentioned in the lectures, the PageRank values are the entries in the dominant eigenvector of the modified
adjacency matrix in which each column’s values adds up to 1 (i.e., “column normalized”), and this eigenvector
can be calculated by the power iteration method, which iterates through the graph’s edges multiple times to
update the nodes’ PageRank values (“pr_values” in pagerank.py) in each iteration :
Q2 [50 pts] Random Forest Classifier
Q2.1 – Random Forest Setup [45 pts]
Note: You must use Python 3.7.x for this question.
You will implement a random forest classifier in Python. The performance of the classifier will be evaluated
via the out-of-bag (OOB) error estimate, using the provided dataset.
Note: You may only use the modules and libraries provided at the top of the .py files included in the skeleton
for Q2 and modules from the Python Standard Library. Python wrappers (or modules) must NOT be used for
this assignment. Pandas must NOT be used — while we understand that they are useful libraries to learn,
completing this question is not critically dependent on their functionality. In addition, to make grading more
manageable and to enable our TAs to provide better, more consistent support to our students, we have
decided to restrict the libraries accordingly.
The dataset you will use is pima-indians-diabetes.csv, a comma-separated (csv) file in the Q2 folder. The
dataset was derived from National Institute of Diabetes and Digestive and Kidney Diseases You must not
modify the dataset. Each row describes one person (a data point, or data record) using 9 columns. The first
8 are attributes. The 9th is the label and you must not treat it as an attribute.
You will perform binary classification on the dataset to determine if a person has a diabetes. Essential
Decision Trees. To complete this question, you will develop a good understanding of how decision trees
work. We recommend that you review the lecture on the decision tree. Specifically, review how to construct
decision trees using Entropy and Information Gain to select the splitting attribute and split point for the
selected attribute. These slides from CMU (also mentioned in the lecture) provide an excellent example of
how to construct a decision tree using Entropy and Information Gain.
Random Forests. To refresh your memory about random forests, see Chapter 15 in the Elements of
Statistical Learning book and the lecture on random forests. Here is a blog post that introduces random
forests in a fun way, in layman’s terms.
Q3 [30 points] Using Scikit-Learn
Note: You must use Python 3.7.x and Scikit-Learn v0.22 for this question.
Scikit-learn is a popular Python library for machine learning. You will use it to train some classifiers to
predict diabetes in the Pima Indian tribe. The dataset is provided in the Q3 folder as pima-indiansdiabetes.csv.
Note: Your code must take no more than 15 minutes to execute all cells.
For this problem you will be utilizing a Jupyter notebook and submitting a python script file.
Note: Do not add any additional print statements to the notebook, you may add them for debugging, but
please make sure to remove any print statements that are not required.
Q3.1 – Data Import [2 pts]
In this step, you will import the pima-indians-diabetes dataset and allocate the data to two separate arrays.
After importing the data set, you will split the data into a training and test set using the scikit-learn function
train_test_split. You will use scikit-learns built-in machine learning algorithms to predict the accuracy of
training and test set separately. Please refer to the hyper-links provided below for each algorithm for more
details, such as the concepts behind these classifiers and how to implement them.
Q3.2 – Linear Regression Classifier [4 pts]
Q3.2.1 – Classification
Train the Linear Regression classifier on the dataset. You will provide the accuracy for both the test
and train sets. Make sure that you round your predictions to a binary value of 0 or 1. See the Jupyter
notebook for more information.
Q3.3 – Random Forest Classifier [10 pts]
Q3.3.1 – Classification
Train the Random Forest classifier on the dataset. You will provide the accuracy for both the test
and train sets. You are not required to round your prediction.
Q3.3.2 – Feature Importance
You have performed a simple classification task using the random forest algorithm. You have also
implemented the algorithm in Q2 above. The concept of entropy gain can also be used to evaluate
the importance of a feature. You will determine the feature importance evaluated by the random
forest classifier in this section. Sort the features in descending order of feature importance score,
and print the sorted features’ numbers.
Hint: There is a function available in sklearn to achieve this. Also, take a look at argsort()
function in Python numpy. argsort() returns the indices of the elements in ascending order. You
will use the random forest classifier that you trained initially in Q3.3.1, without any kind of
hyperparameter-tuning, for reporting these features.
本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！
E-mail: [email protected] 微信:itcsdx