本次Python是从零实现PageRank算法

CSE 6242 / CX 4242: Data and Visual Analytics

HW 4: PageRank Algorithm, Random Forest, Scikit-learn

Homework Overview

Data analytics and machine learning both revolve around using computational models to capture relationships

between variables and outcomes. In this assignment, you will code and fit a range of well-known models from

scratch and learn to use a popular Python library for machine learning.

In Q1, you will implement the famous PageRank algorithm from scratch. PageRank can be thought of as a

model for a system in which a person is surfing the web by choosing uniformly at random a link to click on at

each successive webpage they visit. Assuming this is how we surf the web, what is the probability that we

are on a particular webpage at any given moment? The PageRank algorithm assigns values to each webpage

according to this probability distribution.

In Q2, you will implement Random Forests, a very common and widely successful classification model, from

scratch. Random Forest classifiers also describe probability distributions—the conditional probability of a

sample belonging to a particular class given some or all of its features.

Finally, in Q3, you will use the Python scikit-learn library to specify and fit a variety of supervised and

unsupervised machine learning models.

Q1 [20 pts] Implementation of Page Rank Algorithm

Note: You must use Python 3.7.x for this question.

In this question, you will implement the PageRank algorithm in Python for a large graph network dataset.

The PageRank algorithm was first proposed to rank web pages in search results. The basic assumption is

that more “important” web pages are referenced more often by other pages and thus are ranked higher. The

algorithm works by considering the number and “importance” of links pointing to a page, to estimate how

important that page is. PageRank outputs a probability distribution over all web pages, representing the

likelihood that a person randomly surfing the web (randomly clicking on links) would arrive at those pages.

As mentioned in the lectures, the PageRank values are the entries in the dominant eigenvector of the modified

adjacency matrix in which each column’s values adds up to 1 (i.e., “column normalized”), and this eigenvector

can be calculated by the power iteration method, which iterates through the graph’s edges multiple times to

update the nodes’ PageRank values (“pr_values” in pagerank.py) in each iteration :

Q2 [50 pts] Random Forest Classifier

Q2.1 – Random Forest Setup [45 pts]

Note: You must use Python 3.7.x for this question.

You will implement a random forest classifier in Python. The performance of the classifier will be evaluated

via the out-of-bag (OOB) error estimate, using the provided dataset.

Note: You may only use the modules and libraries provided at the top of the .py files included in the skeleton

for Q2 and modules from the Python Standard Library. Python wrappers (or modules) must NOT be used for

this assignment. Pandas must NOT be used — while we understand that they are useful libraries to learn,

completing this question is not critically dependent on their functionality. In addition, to make grading more

manageable and to enable our TAs to provide better, more consistent support to our students, we have

decided to restrict the libraries accordingly.

The dataset you will use is pima-indians-diabetes.csv, a comma-separated (csv) file in the Q2 folder. The

dataset was derived from National Institute of Diabetes and Digestive and Kidney Diseases You must not

modify the dataset. Each row describes one person (a data point, or data record) using 9 columns. The first

8 are attributes. The 9th is the label and you must not treat it as an attribute.

You will perform binary classification on the dataset to determine if a person has a diabetes. Essential

Reading

Decision Trees. To complete this question, you will develop a good understanding of how decision trees

work. We recommend that you review the lecture on the decision tree. Specifically, review how to construct

decision trees using Entropy and Information Gain to select the splitting attribute and split point for the

selected attribute. These slides from CMU (also mentioned in the lecture) provide an excellent example of

how to construct a decision tree using Entropy and Information Gain.

Random Forests. To refresh your memory about random forests, see Chapter 15 in the Elements of

Statistical Learning book and the lecture on random forests. Here is a blog post that introduces random

forests in a fun way, in layman’s terms.

Q3 [30 points] Using Scikit-Learn

Note: You must use Python 3.7.x and Scikit-Learn v0.22 for this question.

Scikit-learn is a popular Python library for machine learning. You will use it to train some classifiers to

predict diabetes in the Pima Indian tribe. The dataset is provided in the Q3 folder as pima-indiansdiabetes.csv.

Note: Your code must take no more than 15 minutes to execute all cells.

—————————————————————————————————————————-

For this problem you will be utilizing a Jupyter notebook and submitting a python script file.

Note: Do not add any additional print statements to the notebook, you may add them for debugging, but

please make sure to remove any print statements that are not required.

—————————————————————————————————————————-

Q3.1 – Data Import [2 pts]

In this step, you will import the pima-indians-diabetes dataset and allocate the data to two separate arrays.

After importing the data set, you will split the data into a training and test set using the scikit-learn function

train_test_split. You will use scikit-learns built-in machine learning algorithms to predict the accuracy of

training and test set separately. Please refer to the hyper-links provided below for each algorithm for more

details, such as the concepts behind these classifiers and how to implement them.

Q3.2 – Linear Regression Classifier [4 pts]

Q3.2.1 – Classification

Train the Linear Regression classifier on the dataset. You will provide the accuracy for both the test

and train sets. Make sure that you round your predictions to a binary value of 0 or 1. See the Jupyter

notebook for more information.

Q3.3 – Random Forest Classifier [10 pts]

Q3.3.1 – Classification

Train the Random Forest classifier on the dataset. You will provide the accuracy for both the test

and train sets. You are not required to round your prediction.

Q3.3.2 – Feature Importance

You have performed a simple classification task using the random forest algorithm. You have also

implemented the algorithm in Q2 above. The concept of entropy gain can also be used to evaluate

the importance of a feature. You will determine the feature importance evaluated by the random

forest classifier in this section. Sort the features in descending order of feature importance score,

and print the sorted features’ numbers.

Hint: There is a function available in sklearn to achieve this. Also, take a look at argsort()

function in Python numpy. argsort() returns the indices of the elements in ascending order. You

will use the random forest classifier that you trained initially in Q3.3.1, without any kind of

hyperparameter-tuning, for reporting these features.

**程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB**

本网站支持淘宝 支付宝 微信支付 paypal等等交易。如果不放心可以用淘宝交易！

**E-mail:** [email protected] **微信:**itcsdx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。