Python代写数据挖掘 | CS 5710 Data Mining Program 3

本次北美CS代写Python主要是数据挖掘相关的朴素贝叶斯分类器

Program #3 CS 5710: Data Mining Updated: March 26, 2021

Read Section 4.4 of the textbook and the CategoricalNB user guide to complete the following as- signment. This program focuses the Na ̈ıve Bayes classifier. You will replicate a reduced set of functionality for the the Scikit-Learn CategoricalNB classifier. You classifier will use an ‘alpha’ hyperparameter to determine how much smoothing to apply (details in the User Guide). Otherwise, the default hyperparemeters will be used. Your implementation will have the same attributes as the Scikit-Learn version after fitting. You will implement all three predict methods that use the attributes to classify samples in an X matrix.

If you haven’t already done so, complete computer setup:

1 Using the Scikit-Learn Na ̈ıve Bayes Classifier for Text Clas- sification

You can download a data set of Facebook posts from the Roatan Marine Park labeled as Information, Community, or Action:

1.1 Loading the Data

>>> import pandas as pd
>>>
>>> df = pd.read_csv('coding.csv')
>>> df.columns
Index(['Coding', 'Message'], dtype='object')
>>> df['Coding'].value_counts()
Community      543
Information    339
Action         203
Name: Coding, dtype: int64
>>> df.iloc[0]
Coding                                                Action
Message    We are asking every shop around the BAY ISLAND...
Name: 0, dtype: object
>>> len(df)
1085

1.2 Encoding Labels as Integers

>>> from sklearn.preprocessing import LabelEncoder
>>>
>>> le = LabelEncoder()
>>> y = le.fit_transform(df['Coding'])
>>> y
array([0, 2, 2, ..., 2, 1, 1])

>>> le.inverse_transform(y)
array(['Action', 'Information', 'Information', ..., 'Information',
       'Community', 'Community'], dtype=object)

1.3 Encoding Messages as Binary Vectors

>>> from sklearn.feature_extraction.text import CountVectorizer
>>>
>>> vectorizer = CountVectorizer(binary=True)
>>> x = vectorizer.fit_transform(df['Message']).toarray()
>>> x
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])
>>> x.sum(axis=0)
array([4, 1, 1, ..., 3, 1, 1])
>>> x.sum(axis=1)
array([21, 81, 47, ..., 13,  4, 68])
>>> vectorizer.get_feature_names()[-10:]
['yummy', 'yves', 'zack', 'zamorano', 'zing', 'zolitur', 'zone', 'zones', 'zoo', 'zoom']
>>> len(x[0])
5337
>>> len(vectorizer.get_feature_names())
5337

1.4 Training the Categorical Na ̈ıve Bayes Classifier

>>> from sklearn.naive_bayes import CategoricalNB
>>>
>>> cnb = CategoricalNB(alpha=1.0)
>>> cnb.fit(x, y)

1.5 Checking the Classifier Attributes

Your classifier will be graded based on the values of its attributes:

>>> cnb.alpha
1.0
>>> cnb.n_features_
5337
>>> cnb.n_categories_
array([2, 2, 2, ..., 2, 2, 2])
>>> cnb.classes_
array([0, 1, 2])
>>> cnb.class_count_
array([203., 543., 339.])
>>> cnb.class_log_prior_

array([-1.67612929, -0.69222595, -1.16333516])

>>> # category count shown for first 10 samples only: >>> cnb.category_count_[:10]
[array([[201., 2.],

       [541.,   2.],
       [339.,   0.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]]), array([[203.,   0.],
       [543.,   0.],
       [338.,   1.]]), array([[203.,   0.],
       [540.,   3.],
       [338.,   1.]]), array([[203.,   0.],
       [543.,   0.],
       [338.,   1.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]]), array([[202.,   1.],
       [540.,   3.],
       [339.,   0.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]])]

>>> # feature log prob shown for first 10 samples only: >>> cnb.feature_log_prob_[:10] [array([[-1.47422817e-02, -4.22439769e+00],

       [-5.51979322e-03, -5.20217351e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-1.83654781e-03, -6.30078579e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-7.36651582e-03, -4.91449143e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-1.83654781e-03, -6.30078579e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-9.80400010e-03, -4.62986280e+00],
       [-7.36651582e-03, -4.91449143e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]])]

1.6 Checking the Results of its Prediction Methods

Your classifier will also be evaluated by its prediction methods, shown for the first 10 samples only:

>>> from sklearn.metrics import balanced_accuracy_score
>>> cnb.feature_log_prob_[:10]
[array([[-1.47422817e-02, -4.22439769e+00],
       [-5.51979322e-03, -5.20217351e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-1.83654781e-03, -6.30078579e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-7.36651582e-03, -4.91449143e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-1.83654781e-03, -6.30078579e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-9.80400010e-03, -4.62986280e+00],
       [-7.36651582e-03, -4.91449143e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]])]
>>> cnb.predict_proba(x[:10])
array([[2.46864675e-08, 9.99931480e-01, 6.84948211e-05],
       [8.71727403e-17, 1.05522867e-15, 1.00000000e+00],
       [1.90157124e-08, 9.03684150e-07, 9.99999077e-01],
       [3.72990939e-10, 9.99997797e-01, 2.20227313e-06],
       [9.03079066e-01, 8.48501366e-02, 1.20707979e-02],
       [1.55654960e-15, 1.55826100e-14, 1.00000000e+00],
       [1.08224470e-13, 9.99999990e-01, 9.79148592e-09],
       [3.01284575e-10, 9.99999962e-01, 3.81734871e-08],
       [5.65355094e-07, 9.99998536e-01, 8.98507626e-07],
       [1.16639745e-10, 9.99999999e-01, 6.89753793e-10]])
>>> cnb.predict_log_proba(x[:10])
array([[-1.75170106e+01, -6.85218551e-05, -9.58875242e+00],
       [-3.69786400e+01, -3.44850189e+01,  0.00000000e+00],
       [-1.77780002e+01, -1.39167859e+01, -9.22700281e-07],
       [-2.17094670e+01, -2.20264855e-06, -1.30260205e+01],
       [-1.01945171e-01, -2.46686868e+00, -4.41696614e+00],
       [-3.40963048e+01, -3.17926208e+01, -2.84217094e-14],
       [-2.98545689e+01, -9.79159154e-09, -1.84417526e+01],
       [-2.19229659e+01, -3.84747807e-08, -1.70811246e+01],
       [-1.43858118e+01, -1.46386380e-06, -1.39225306e+01],
       [-2.28719310e+01, -8.06394951e-10, -2.10946864e+01]])
>>> cnb.predict(x[:10])
array([1, 2, 2, 1, 0, 2, 1, 1, 1, 1])
>>> y_hat = cnb.predict(x)
>>> training_accuracy = balanced_accuracy_score(y, y_hat)
>>> training_accuracy
0.6411176828416015

2 ImplementingtheCategoricalNa ̈ıveBayesClassifier

In a file named naive bayes.py, you will implement the CategoricalNB class, including the methods __init__, fit, predict, predict_proba, predict_log_proba. After calling the fit method, all of
the following attributes must be set: alpha, n_features_, classes_, class_count_, class_log_prior_, category_count_, and feature_log_prob_. For more information on what each attribute means, read the CategoricalNB documentation

# naive_bayes.py
import numpy as np
class CategoricalNB:
    def __init__(self, alpha=1.0):

3

pass

def fit(self, x, y):
    return self
def predict_proba(self, x):
    pass
def predict_log_proba(self, x):
    pass
def predict(self, x):
    pass

Submitting on Web-CAT

Web-CAT will generate random data, train your classifier, train the Scikit-Learn CategoricalNB classifier, and check to see that the attribute values match and functions return the same values.