2023年11月25日

Python代写数据挖掘 | CS 5710 Data Mining Program 3

本次北美CS代写Python主要是数据挖掘相关的朴素贝叶斯分类器

Program #3 CS 5710: Data Mining Updated: March 26, 2021

Read Section 4.4 of the textbook and the CategoricalNB user guide to complete the following as- signment. This program focuses the Na ̈ıve Bayes classifier. You will replicate a reduced set of functionality for the the Scikit-Learn CategoricalNB classifier. You classifier will use an ‘alpha’ hyperparameter to determine how much smoothing to apply (details in the User Guide). Otherwise, the default hyperparemeters will be used. Your implementation will have the same attributes as the Scikit-Learn version after fitting. You will implement all three predict methods that use the attributes to classify samples in an X matrix.

If you haven’t already done so, complete computer setup:

1 Using the Scikit-Learn Na ̈ıve Bayes Classifier for Text Clas- sification

You can download a data set of Facebook posts from the Roatan Marine Park labeled as Information, Community, or Action:

1.1 Loading the Data

>>> import pandas as pd
>>>
>>> df = pd.read_csv('coding.csv')
>>> df.columns
Index(['Coding', 'Message'], dtype='object')
>>> df['Coding'].value_counts()
Community      543
Information    339
Action         203
Name: Coding, dtype: int64
>>> df.iloc[0]
Coding                                                Action
Message    We are asking every shop around the BAY ISLAND...
Name: 0, dtype: object
>>> len(df)
1085

1.2 Encoding Labels as Integers

>>> from sklearn.preprocessing import LabelEncoder
>>>
>>> le = LabelEncoder()
>>> y = le.fit_transform(df['Coding'])

>>> y
array([0, 2, 2, ..., 2, 1, 1])

>>> le.inverse_transform(y)
array(['Action', 'Information', 'Information', ..., 'Information',

       'Community', 'Community'], dtype=object)

1.3 Encoding Messages as Binary Vectors

>>> from sklearn.feature_extraction.text import CountVectorizer
>>>
>>> vectorizer = CountVectorizer(binary=True)
>>> x = vectorizer.fit_transform(df['Message']).toarray()

>>> x
array([[0, 0, 0, ..., 0, 0, 0],

       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

>>> x.sum(axis=0)
array([4, 1, 1, ..., 3, 1, 1])
>>> x.sum(axis=1)
array([21, 81, 47, ..., 13,  4, 68])
>>> vectorizer.get_feature_names()[-10:]
['yummy', 'yves', 'zack', 'zamorano', 'zing', 'zolitur', 'zone', 'zones', 'zoo', 'zoom']
>>> len(x[0])
5337
>>> len(vectorizer.get_feature_names())
5337

1.4 Training the Categorical Na ̈ıve Bayes Classifier

>>> from sklearn.naive_bayes import CategoricalNB
>>>
>>> cnb = CategoricalNB(alpha=1.0)
>>> cnb.fit(x, y)

1.5 Checking the Classifier Attributes

Your classifier will be graded based on the values of its attributes:

>>> cnb.alpha
1.0
>>> cnb.n_features_
5337
>>> cnb.n_categories_
array([2, 2, 2, ..., 2, 2, 2])
>>> cnb.classes_
array([0, 1, 2])
>>> cnb.class_count_
array([203., 543., 339.])
>>> cnb.class_log_prior_

array([-1.67612929, -0.69222595, -1.16333516])

>>> # category count shown for first 10 samples only: >>> cnb.category_count_[:10]
[array([[201., 2.],

       [541.,   2.],
       [339.,   0.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]]), array([[203.,   0.],
       [543.,   0.],
       [338.,   1.]]), array([[203.,   0.],
       [540.,   3.],
       [338.,   1.]]), array([[203.,   0.],
       [543.,   0.],
       [338.,   1.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]]), array([[202.,   1.],
       [540.,   3.],
       [339.,   0.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]]), array([[203.,   0.],
       [542.,   1.],
       [339.,   0.]])]

>>> # feature log prob shown for first 10 samples only: >>> cnb.feature_log_prob_[:10] [array([[-1.47422817e-02, -4.22439769e+00],

       [-5.51979322e-03, -5.20217351e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-1.83654781e-03, -6.30078579e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-7.36651582e-03, -4.91449143e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-1.83654781e-03, -6.30078579e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-9.80400010e-03, -4.62986280e+00],
       [-7.36651582e-03, -4.91449143e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]])]

1.6 Checking the Results of its Prediction Methods

Your classifier will also be evaluated by its prediction methods, shown for the first 10 samples only:

>>> from sklearn.metrics import balanced_accuracy_score
>>> cnb.feature_log_prob_[:10]
[array([[-1.47422817e-02, -4.22439769e+00],

       [-5.51979322e-03, -5.20217351e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-1.83654781e-03, -6.30078579e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-7.36651582e-03, -4.91449143e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-1.83654781e-03, -6.30078579e+00],
       [-5.88236990e-03, -5.13873530e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-9.80400010e-03, -4.62986280e+00],
       [-7.36651582e-03, -4.91449143e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]]), array([[-4.88998529e-03, -5.32300998e+00],
       [-3.67647473e-03, -5.60763861e+00],
       [-2.93685967e-03, -5.83188248e+00]])]

>>> cnb.predict_proba(x[:10])
array([[2.46864675e-08, 9.99931480e-01, 6.84948211e-05],

       [8.71727403e-17, 1.05522867e-15, 1.00000000e+00],
       [1.90157124e-08, 9.03684150e-07, 9.99999077e-01],
       [3.72990939e-10, 9.99997797e-01, 2.20227313e-06],
       [9.03079066e-01, 8.48501366e-02, 1.20707979e-02],
       [1.55654960e-15, 1.55826100e-14, 1.00000000e+00],
       [1.08224470e-13, 9.99999990e-01, 9.79148592e-09],
       [3.01284575e-10, 9.99999962e-01, 3.81734871e-08],
       [5.65355094e-07, 9.99998536e-01, 8.98507626e-07],
       [1.16639745e-10, 9.99999999e-01, 6.89753793e-10]])

>>> cnb.predict_log_proba(x[:10])
array([[-1.75170106e+01, -6.85218551e-05, -9.58875242e+00],

       [-3.69786400e+01, -3.44850189e+01,  0.00000000e+00],
       [-1.77780002e+01, -1.39167859e+01, -9.22700281e-07],
       [-2.17094670e+01, -2.20264855e-06, -1.30260205e+01],
       [-1.01945171e-01, -2.46686868e+00, -4.41696614e+00],
       [-3.40963048e+01, -3.17926208e+01, -2.84217094e-14],
       [-2.98545689e+01, -9.79159154e-09, -1.84417526e+01],
       [-2.19229659e+01, -3.84747807e-08, -1.70811246e+01],
       [-1.43858118e+01, -1.46386380e-06, -1.39225306e+01],
       [-2.28719310e+01, -8.06394951e-10, -2.10946864e+01]])

>>> cnb.predict(x[:10])
array([1, 2, 2, 1, 0, 2, 1, 1, 1, 1])
>>> y_hat = cnb.predict(x)
>>> training_accuracy = balanced_accuracy_score(y, y_hat)
>>> training_accuracy
0.6411176828416015

2 ImplementingtheCategoricalNa ̈ıveBayesClassifier

In a file named naive bayes.py, you will implement the CategoricalNB class, including the methods __init__, fit, predict, predict_proba, predict_log_proba. After calling the fit method, all of
the following attributes must be set: alpha, n_features_, classes_, class_count_, class_log_prior_, category_count_, and feature_log_prob_. For more information on what each attribute means, read the CategoricalNB documentation

# naive_bayes.py

import numpy as np

class CategoricalNB:
    def __init__(self, alpha=1.0):

pass

def fit(self, x, y):
    return self

def predict_proba(self, x):
    pass

def predict_log_proba(self, x):
    pass

def predict(self, x):
    pass

Submitting on Web-CAT

Web-CAT will generate random data, train your classifier, train the Scikit-Learn CategoricalNB classifier, and check to see that the attribute values match and functions return the same values.

程序代写代做C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB

CS代写,留学生编程代写,CS作业代写,Java代写,程序代写，代码代写 | ITCS代写

本网站支持淘宝支付宝微信支付 paypal等等交易。如果不放心可以用淘宝交易！

E-mail:itcsdx@outlook.com 微信:itcsdx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。

JavaScript代写 | CS170 – Computer Applications for Business 数据可视化代写｜7CCSMSDV Introduction to Data Visualization Summer 2023 (Period 2)

CONTACT

Assignment Example

Service Scope

Recent Case

2024年10月8日

ITCS代写

Python代写数据挖掘 | CS 5710 Data Mining Program 3

CONTACT

Assignment Example

Service Scope

Recent Case

MySQL数据库学习指南：留学生如何在不同国家的课程和就业形势下脱颖而出

北美计算机留学高校整理与热门专业前景分析

留学生计算机代写常见服务有哪些？

留学生程序代写靠谱吗

留学生如何选择机器学习方向的专业

Tags