Logistic In MNIST

Logistic In MNIST

算法原理简述:

二项逻辑斯蒂回归模型是一种分类模型,有条件概率P(Y|X)表示,形式为参数化的逻辑斯蒂分布。

在本实验中,对于该模型的二项形式,因为只能做二分类,所以我们对数据集的标签进行的改变只剩下0和1;而后输入的依旧是图像的向量,主要需要掌握的是关于该模型的推导和如何迭代得到近似最优解。

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @File : LogisticRegression.py
# @Author: mfzhu
# @Date : 2017/4/15
# @Desc :
import time
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def binary(data, threshold):
"""
:param data: 需要二值化的数据
:param threshold: 二值化的阈值
:return: 二值化后的数据
"""
index_1 = np.array(data > threshold)
index_2 = np.array(data <= threshold)
data[index_1] = 1
data[index_2] = 0
return data
class LogisticRegression(object):
def __init__(self):
self.max_iteration = 10000
# 定义权重的迭代次数
self.step = 0.00001
# 定义每次迭代的步长
self.weights = []
# 定义权重
def sigmoid(self, x):
"""
:param x: 样本数据
:return: 该样本数据对应的sigmoid函数值
"""
return np.exp(x) / (1 + np.exp(x))
# 定义计算Logit函数值
def _predict(self, feature):
"""
:param feature: 内部预测函数用于调用,预测样本为feature的结果
:return: 返回该样本对应的标签
"""
wx = sum([self.weights[i] * feature[i] for i in range(len(feature))])
predict1 = self.sigmoid(wx)
predict0 = 1 - predict1
if predict1 > predict0:
return 1
else:
return 0
# 内部预测函数用于调用,先计算权重和输入向量的乘积,然后计算sigmoid值
def train(self, train_data, label_data):
"""
:param train_data: 训练数据
:param label_data: 训练数据对应的标签
:return:
"""
self.weights = np.ones(train_data.shape[1] + 1, float)
# 权重加上一维截距项
self.weights.shape = (train_data.shape[1] + 1, 1)
# 定义权重的行和列(python中列值为空)
train_data = np.column_stack((train_data, np.ones(train_data.shape[0])))
# 训练数据在末尾加上一列全1的数据
train_data = np.matrix(train_data)
# 转化为矩阵方便计算
time = 0
# 计算当前迭代次数
while time < self.max_iteration:
wx_matrix = train_data * self.weights
# 计算权重和训练数据的乘积
exp_wx = self.sigmoid(wx_matrix)
# 计算sigmoid值
difference = np.matrix(label_data) - exp_wx.T
# 计算sigmoid值和标签值的差
self.weights += self.step * (difference * train_data).T
# 改变权重值
time += 1
def predict(self, features):
"""
:param features:
:return:
"""
label = []
for feature in features:
x = list(feature)
x.append(1)
label.append(self._predict(x))
# 调用先前定义的内部预测函数用来预测测试数据的标签
return label
if __name__ == '__main__':
time1 = time.time()
print("start reading the data:")
raw_path = r'F:\work and learn\ML\dataset\MNIST\train.csv'
raw_data = np.loadtxt(raw_path, delimiter=',', skiprows=1)
# 读入原始数据
label = raw_data[:, 0]
data = raw_data[:, 1:]
# 区分标签数据和原始数据
data = binary(data, 100)
label[label != 0] = 1
# 进行二值化和对标签二值化(这边做的是二分类)
train_data, test_data, train_label, test_label = train_test_split(data, label, test_size=0.33, random_state=23323)
time2 = time.time()
print("read data cost:", time2 - time1, " seconds", '\n')
# 将训练数据切分成三块用于训练和测试
print("start training the model:")
lr = LogisticRegression()
lr.train(train_data, train_label)
time3 = time.time()
print("training cost:", time3 - time2, ' seconds', '\n')
print("start predicting:")
# 定义logit类然后训练
test_predict = lr.predict(test_data)
score = accuracy_score(test_label, test_predict)
time4 = time.time()
print("predicting cost:", time4 - time3, ' seconds', '\n')
# 利用训练到的模型预测测试集的标签
print("the accuracy_score is:", score)

模型的训练结果:

可以看到,逻辑斯蒂的正确率很高,但是相对的,训练的时间也较长。

img