简单的二层网络

想法

不得不说现在的神经网络框架把反向传播做的太好了。

笔记

代码参照:cs231n/classifiers/neural_net.py

总结

简单二层网络结构
W1: First layer weights; has shape (D, H)
b1: First layer biases; has shape (H,)
W2: Second layer weights; has shape (H, C)
b2: Second layer biases; has shape (C,)

矩阵的计算
需要将一个矩阵的每行数据,减去每行的最大值,或除以每行的和时
就先使用 axis=1 的操作,再使用 reshape(-1, 1)
这样就可以通过广播的机制来更新整个矩阵

前向传播

1
2
h_output = np.maximum(0, X.dot(W1) + b1)
scores = h_output.dot(W2) + b2

损失计算

1
2
3
4
5
shift_scores = scores - np.max(scores, axis=1).reshape(-1, 1)
softmax_output = np.exp(shift_scores) / np.sum(np.exp(shift_scores), axis=1).reshape(-1, 1)
loss = -np.sum(np.log(softmax_output[range(N), list(y)]))
loss /= N
loss += 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))

反向传播
这块有点看不懂

1
2
3
4
5
6
7
8
9
10
dscores = softmax_output.copy()
dscores[range(N), list(y)] -= 1
dscores /= N
grads['W2'] = h_output.T.dot(dscores) + reg * W2
grads['b2'] = np.sum(dscores, axis=0)

dh = dscores.dot(W2.T)
dh_ReLU = (h_output > 0) * dh
grads['W1'] = X.T.dot(dh_ReLU) + reg * W1
grads['b1'] = np.sum(dh_ReLU, axis=0)

梯度下降

1
2
3
4
self.params['W2'] += - learning_rate * grads['W2']
self.params['b2'] += - learning_rate * grads['b2']
self.params['W1'] += - learning_rate * grads['W1']
self.params['b1'] += - learning_rate * grads['b1']

简单的二层网络

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
class TwoLayerNet(object):

def __init__(self, input_size, hidden_size, output_size, std=1e-4):
"""
W1: First layer weights; has shape (D, H)
b1: First layer biases; has shape (H,)
W2: Second layer weights; has shape (H, C)
b2: Second layer biases; has shape (C,)

Inputs:
- input_size: The dimension D of the input data.
- hidden_size: The number of neurons H in the hidden layer.
- output_size: The number of classes C.
"""
self.params = {}
self.params['W1'] = std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)

def loss(self, X, y=None, reg=0.0):
"""
Compute the loss and gradients for a two layer fully connected neural
network.

Inputs:
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.
"""
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape

# Compute the forward pass
scores = None
#############################################################################
# TODO: Perform the forward pass, computing the class scores for the input. #
# Store the result in the scores variable, which should be an array of #
# shape (N, C). #
#############################################################################
"""
此处是简单计算了中间层的输出和输出层输出
注意:使用np.maximum来实现非线性函数ReLU
"""
h_output = np.maximum(0, X.dot(W1) + b1)
scores = h_output.dot(W2) + b2
#############################################################################
# END OF YOUR CODE #
#############################################################################

# If the targets are not given then jump out, we're done
if y is None:
return scores

# Compute the loss
loss = None
#############################################################################
# TODO: Finish the forward pass, and compute the loss. This should include #
# both the data loss and L2 regularization for W1 and W2. Store the result #
# in the variable loss, which should be a scalar. Use the Softmax #
# classifier loss. #
#############################################################################
"""
scores: has shape (N, C)
一行代表一个实例在每个class上的分数
首先使用 np.max(scores, axis) 来计算每个输入的最大分数
由于 np.max 输出是一维数组, 所以要使用reshape(-1, 1)变成 (N, 1) 的数组,然后和原来的分数相减
shift_scores在正类上的分数都是0,其他位置是和正类的差值

softmax_output的计算方式也一样
将整个数组exp后,除以每行exp的sum(先axis=1再reshape(-1,1)的原因)

loss的计算
通过使用 softmax_output[range(N), list(y)] 的方式将每个输入的正类分数取出

summary:
需要将一个矩阵的每行数据,减去每行的最大值,或除以每行的和时
就先使用 axis=1 的操作,再使用 reshape(-1, 1)
这样就可以通过广播的机制来更新整个矩阵
"""
shift_scores = scores - np.max(scores, axis=1).reshape(-1, 1)
softmax_output = np.exp(shift_scores) / np.sum(np.exp(shift_scores), axis=1).reshape(-1, 1)
loss = -np.sum(np.log(softmax_output[range(N), list(y)]))
loss /= N
loss += 0.5 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
#############################################################################
# END OF YOUR CODE #
#############################################################################

# Backward pass: compute gradients
grads = {}
#############################################################################
# TODO: Compute the backward pass, computing the derivatives of the weights #
# and biases. Store the results in the grads dictionary. For example, #
# grads['W1'] should store the gradient on W1, and be a matrix of same size #
#############################################################################
dscores = softmax_output.copy()
dscores[range(N), list(y)] -= 1
dscores /= N
grads['W2'] = h_output.T.dot(dscores) + reg * W2
grads['b2'] = np.sum(dscores, axis=0)

dh = dscores.dot(W2.T)
dh_ReLU = (h_output > 0) * dh
grads['W1'] = X.T.dot(dh_ReLU) + reg * W1
grads['b1'] = np.sum(dh_ReLU, axis=0)
#############################################################################
# END OF YOUR CODE #
#############################################################################

return loss, grads

def train(self, X, y, X_val, y_val,
learning_rate=1e-3, learning_rate_decay=0.95,
reg=5e-6, num_iters=100,
batch_size=200, verbose=False):
"""
Train this neural network using stochastic gradient descent.

Inputs:
- X: A numpy array of shape (N, D) giving training data.
- y: A numpy array f shape (N,) giving training labels; y[i] = c means that
X[i] has label c, where 0 <= c < C.
- X_val: A numpy array of shape (N_val, D) giving validation data.
- y_val: A numpy array of shape (N_val,) giving validation labels.
- learning_rate: Scalar giving learning rate for optimization.
- learning_rate_decay: Scalar giving factor used to decay the learning rate
after each epoch.
- reg: Scalar giving regularization strength.
- num_iters: Number of steps to take when optimizing.
- batch_size: Number of training examples to use per step.
- verbose: boolean; if true print progress during optimization.
"""
num_train = X.shape[0]
iterations_per_epoch = max(num_train / batch_size, 1)

# Use SGD to optimize the parameters in self.model
loss_history = []
train_acc_history = []
val_acc_history = []

for it in xrange(num_iters):
X_batch = None
y_batch = None

#########################################################################
# TODO: Create a random minibatch of training data and labels, storing #
# them in X_batch and y_batch respectively. #
#########################################################################
idx = np.random.choice(num_train, batch_size, replace = True)
X_batch = X[idx]
y_batch = y[idx]
#########################################################################
# END OF YOUR CODE #
#########################################################################

# Compute loss and gradients using the current minibatch
loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
loss_history.append(loss)

#########################################################################
# TODO: Use the gradients in the grads dictionary to update the #
# parameters of the network (stored in the dictionary self.params) #
# using stochastic gradient descent. You'll need to use the gradients #
# stored in the grads dictionary defined above. #
#########################################################################
self.params['W2'] += - learning_rate * grads['W2']
self.params['b2'] += - learning_rate * grads['b2']
self.params['W1'] += - learning_rate * grads['W1']
self.params['b1'] += - learning_rate * grads['b1']
#########################################################################
# END OF YOUR CODE #
#########################################################################

if verbose and it % 100 == 0:
print('iteration %d / %d: loss %f' % (it, num_iters, loss))

# Every epoch, check train and val accuracy and decay learning rate.
if it % iterations_per_epoch == 0:
# Check accuracy
train_acc = (self.predict(X_batch) == y_batch).mean()
val_acc = (self.predict(X_val) == y_val).mean()
train_acc_history.append(train_acc)
val_acc_history.append(val_acc)

# Decay learning rate
learning_rate *= learning_rate_decay

return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history,
}

def predict(self, X):
"""
Use the trained weights of this two-layer network to predict labels for
data points. For each data point we predict scores for each of the C
classes, and assign each data point to the class with the highest score.

Inputs:
- X: A numpy array of shape (N, D) giving N D-dimensional data points to
classify.

Returns:
- y_pred: A numpy array of shape (N,) giving predicted labels for each of
the elements of X. For all i, y_pred[i] = c means that X[i] is predicted
to have class c, where 0 <= c < C.
"""
y_pred = None

###########################################################################
# TODO: Implement this function; it should be VERY simple! #
###########################################################################
h = np.maximum(0, X.dot(self.params['W1']) + self.params['b1'])
scores = h.dot(self.params['W2']) + self.params['b2']
y_pred = np.argmax(scores, axis=1)
###########################################################################
# END OF YOUR CODE #
###########################################################################

return y_pred