Naive Bayes In MNIST

Naive Bayes In MNIST

算法简述:

朴素贝叶斯方法是基于贝叶斯定理与特征条件独立假设的分类方法。对于给定的训练数据集,首先基于特征条件独立假设学习输入和输出的联合概率分布;然后基于此模型,对给定的输入利用贝叶斯定理求出后验概率最大的输出。模型的优点是实现简单学习和预测的效率都很高,是一种常用的方法。

在MNIST数据集中,我们的特征是输入图像每一维的数据,根据已有的输入和输出估计联合概率分布;在输出新的数据集的情况下,我们根据训练好的模型,计算最大后验概率来对数据集进行分类。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
"""
@Author: mfzhu
"""
import numpy as np
from collections import Counter
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def binary(data, threshold):
"""
:param data: the data need to be binary
:param threshold: threshold value for binaryzation
:return:the data after binaryzation
"""
data[data > threshold] = 255
data[data <= threshold] = 0
return data
def get_the_prior_probability(label):
"""
:param label: label data
:return: the dict contains the prior probability
"""
prior_pro = Counter(label)
sample_num = len(label)
for key in prior_pro.keys():
prior_pro[key] = (prior_pro[key] + 1) / (sample_num + 10)
return prior_pro
def get_the_conditional_probability(data, label):
"""
:param data: the binaryzation data
:param label: label data
:return: the conditional probability
"""
per_label_data = {}
condition_pro = {}
for i in range(10):
per_label_data[i] = data[np.where(label == i)]
for key in per_label_data.keys():
pro_array = []
for j in range(784):
pro_array.append(
(np.count_nonzero(per_label_data[key][:, j]) + 1) / (per_label_data[key].shape[0] + 2))
condition_pro[key] = pro_array
return condition_pro
def sample_map(input_data, condition_pro, prior_pro):
"""
:param input_data: singal sample data
:param condition_pro: conditional probability
:param prior_pro: prior probability
:return: the tag of sample data according map
"""
result = {}
for key in prior_pro.keys():
pro = prior_pro[key]
for k in range(len(input_data)):
if input_data[k] != 0:
pro *= condition_pro[key][k]
else:
pro *= (1 - condition_pro[key][k])
result[key] = pro
return max(zip(result.values(), result.keys()))[1]
def data_set_map(data, condition_pro, prior_pro):
"""
:param data: data set
:param condition_pro:
:param prior_pro:
:return: a list contains the tags of input data set
"""
result = []
for j in range(data.shape[0]):
result.append(sample_map(data[j, :], condition_pro, prior_pro))
return result
if __name__ == '__main__':
raw_path = r'F:\work and learn\ML\dataset\MNIST\train.csv'
# raw data file path
test_path = r'F:\work and learn\ML\dataset\MNIST\test.csv'
# test data file path
print("start read data:")
time1 = time.time()
raw_data = np.loadtxt(raw_path, delimiter=',', skiprows=1)
label = raw_data[:, 0]
data = raw_data[:, 1:]
# extract the label data and image data
data = binary(data, 50)
# binary the image data
data_train, data_test, label_train, label_test = train_test_split(data, label, test_size=0.33,
random_state=23333)
# split the train data for training and testing
time2 = time.time()
print("read data cost:", time2 - time1, " seconds", '\n')
print("start training:")
prior_pro = get_the_prior_probability(label_train)
condition_pro = get_the_conditional_probability(data_train, label_train)
time3 = time.time()
print("training cost: ", time3 - time2, " seconds", '\n')
# using the train data and train label to calculate the prior probability
# and conditional probability
print("start predicting:")
predict = data_set_map(data_test, condition_pro, prior_pro)
train_result = accuracy_score(label_test, predict)
time4 = time.time()
print("predict cost: ", time4 - time3, " seconds", '\n')
print("the accuracy is: ", train_result)

模型训练结果:

相比之下朴素贝叶斯的正确较差,但是胜在模型简单,并且训练速度快。

img