diff --git a/assignment-2/submission/17307130133/.keep b/assignment-2/submission/17307130133/.keep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assignment-2/submission/17307130133/README.md b/assignment-2/submission/17307130133/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a2949d3c15d69eb19fb37c1d207072950517f373 --- /dev/null +++ b/assignment-2/submission/17307130133/README.md @@ -0,0 +1,128 @@ +# 选题1 + +## 前馈神经网络的实现 + +### Matmul + +forward时 +$$ +Y=XW +$$ +backward时 +$$ +\frac {\partial L}{\partial X}=\frac {\partial L}{\partial Y}\frac {\partial Y}{\partial X}=\frac {\partial L}{\partial Y}\frac {\partial XW}{\partial X}=\frac {\partial L}{\partial Y} W^T\\\\ +\frac {\partial L}{\partial W}=\frac {\partial L}{\partial Y}\frac {\partial Y}{\partial W}=\frac {\partial L}{\partial Y}\frac {\partial XW}{\partial W}=X^T\frac {\partial L}{\partial Y} +$$ + +### Relu + +forward时 +$$ +Y_{ij}= \begin{cases} X_{ij}& X_{ij}\geq 0 \\\\ +0& otherwise \end{cases} +$$ +backward时 +$$ +\frac{\partial L}{\partial X_{ij}}=\frac{\partial L}{\partial Y_{ij}} \frac{\partial Y_{ij}}{\partial X_{ij}} +\\\\ +\frac{\partial Y_{ij}}{\partial X_{ij}}= \begin{cases} 1 & X_{ij}\geq 0 \\\\ +0& otherwise \end{cases}\\\\ +\therefore \frac{\partial L}{\partial X_{ij}}=\begin{cases} \frac{\partial L}{\partial Y_{ij}} & X_{ij}\geq 0 \\\\ +0& otherwise \end{cases} +$$ + +### Log + +forward时 +$$ +Y_{ij}=ln(X_{ij}+\epsilon) +$$ +backward时 +$$ +\frac{\partial L}{\partial X_{ij}}=\frac{\partial L}{\partial Y_{ij}} \frac{\partial Y_{ij}}{\partial X_{ij}}=\frac{\partial L}{\partial Y_{ij}}\frac {1}{X_{ij}+\epsilon} +$$ + +### Softmax + +forward时 +$$ +Y_{ij} = \frac {expX_{ij}}{\sum_k expX_{ik}} +$$ +backward时,参考课本 + +![softmax](./img/softmax.png) + +Softmax是对每一行进行的操作,由此可以得到每一行的Jacob矩阵为 +$$ +Jacob = diag(Y_r) - Y_rY_r^T +$$ + +### 补全NumpyModel + +forward部分根据torch_mnisy.py补全,得到网络结构为 + +input -> matmul_1 -> relu_1 -> matmul_2 -> relu_2 -> matmul_3 -> softmax -> log + +根据此结构,依次反向调用之前实现的backward部分即可 + +## 实验 + +### 实现mini_batch + +有时候训练集数据过多,如果对整个数据集计算平均的损失函数值然后计算梯度进而进行反向传播,训练速度会比较慢;使用随机梯度下降则可能会导致参数无法收敛到最优值。mini_batch方法是这两种方法的折中,将数据集划分为若干批次,在提高训练速度的同时保证了较好的收敛效果。 + +参考utils.py中mini_batch,在numpy_mnist.py中实现了mini_batch,可调参数有dataset,shuffle,batch_size。下面实验均在我实现的mini_batch上进行。 + +### 在numpy_mnist.py中进行模型的训练和测试 + +通过改变batch_size、学习率、是否shuffle这三个变量进行了实验。 + +![batch_size](./img/batch_size.jpg) + +上图左边一列是batch_size为64时的情况,右边两列是batch_size为128时的情况;上面一行学习率为0.1,下面一行学习率为0.05。本次实验都进行了shuffle。可以看出,batch_size越小,在同样的step里收敛得越慢;在训练前期震荡比较大;在训练后期损失函数的震荡反而较小。 + +![lr](./img/lr.jpg) + +上图左边一列是学习率为0.05时的情况,右边两列是学习率为0.1时的情况;同一行的数据其他实验参数相同。可以看出,学习率越小,收敛得就越慢。出乎我意料的是,学习率越小,在收敛后的实验后期,损失函数震荡比学习率大的更严重。 + +![shuffle](./img/shuffle.jpg) + +上图左边未对训练数据进行shuffle,右边进行了shuffle,其他条件一样。可以看出未进行shuffle的模型收敛更慢,损失函数的震荡更大。这是因为在我的mini_batch的实现中,不进行shuffle的话,每次梯度的更新都是固定的total_size/batch_size组数据,相当于数据量减少了。 + +## momentum、Adam等优化 + +SGD会造成损失函数的震荡,可能会使最后的结果停留在一个局部最优点。为了抑制SGD的震荡,提出了SGD with Momentum的概念。SGDM在梯度下降过程中加入了一阶动量 +$$ +m_t = \beta _1m_{t-1}+(1-\beta_1)g_t +$$ +也就是说,在梯度下降方向不变时,梯度下降的速度变快,梯度下降方向有所变化的维度上更新速度变慢。这里$\beta _1$的经验值为0.9。也就是说,梯度下降的方向主要是之前累积的下降方向,并略微偏向当前时刻的下降方向。 + +除了一阶动量,二阶动量也被广泛使用。对于经常更新的参数,我们已经积累了大量关于它的知识,不希望被单个样本影响太大,希望学习速率慢一些;对于偶尔更新的参数,我们了解的信息太少,希望能从每个偶然出现的样本身上多学一些,即学习速率大一些。二阶动量的一种比较好的计算方法是:不累积全部历史梯度,只关注过去一段时间窗口的下降梯度,这种方法解决了学习率急剧下降的问题。这就是AdaDelta/RMSProp。 +$$ +V_t = \beta _ 2 V_{t-1}+(1-\beta_2)g_t^2 +$$ +由此,Adam自然而然地出现了,Adam把一阶动量和二阶动量都用起来,下降梯度为 +$$ +\Delta = \alpha\ m_t/\sqrt V_t +$$ +一般为了避免分母为0,会在分母上加一个小的平滑项。$\beta _2$的经验值为0.999,Adam学习率为0.001。 + +参考:https://zhuanlan.zhihu.com/p/32230623 + +![op](./img/op.jpg) + +上图从左往右第一张图为不使用优化器,第二张图为使用momentum优化,第三张图为使用Adam优化,其中,前两张图学习率为0.1,最后一张图学习率为0.001。可以看出,使用momentum之后前期收敛速度加快,并且在后期损失函数震荡减少。使用Adam优化器,虽然学习率低,但是收敛速度更快,并且损失函数的震荡从训练器前期到后期一直比不使用要小。 + +## 总结 + +完成了自动测试 60% + +![tester](./img/tester.png) + +在numpy_minist.py中对模型进行训练和测试 20% + +只用NumPy实现了mini_batch函数 10% + +推导了numpy_fnn.py中算子的反向传播公式 10% + +实现了momentum、Adam等优化方法并进行了对比实验 10% \ No newline at end of file diff --git a/assignment-2/submission/17307130133/img/.keep b/assignment-2/submission/17307130133/img/.keep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assignment-2/submission/17307130133/img/batch_size.jpg b/assignment-2/submission/17307130133/img/batch_size.jpg new file mode 100644 index 0000000000000000000000000000000000000000..0ebacf156ea994b2ea3133c1b70b25efa3825d56 Binary files /dev/null and b/assignment-2/submission/17307130133/img/batch_size.jpg differ diff --git a/assignment-2/submission/17307130133/img/lr.jpg b/assignment-2/submission/17307130133/img/lr.jpg new file mode 100644 index 0000000000000000000000000000000000000000..822e355d41cc931ae2cc991f2ffa17d4b4a66e0b Binary files /dev/null and b/assignment-2/submission/17307130133/img/lr.jpg differ diff --git a/assignment-2/submission/17307130133/img/op.jpg b/assignment-2/submission/17307130133/img/op.jpg new file mode 100644 index 0000000000000000000000000000000000000000..f027192dc284b4fc444aa33fe4e6a63f401a4b97 Binary files /dev/null and b/assignment-2/submission/17307130133/img/op.jpg differ diff --git a/assignment-2/submission/17307130133/img/shuffle.jpg b/assignment-2/submission/17307130133/img/shuffle.jpg new file mode 100644 index 0000000000000000000000000000000000000000..ca6fc568cefd9fcf395bda0616727b8b346f07a0 Binary files /dev/null and b/assignment-2/submission/17307130133/img/shuffle.jpg differ diff --git a/assignment-2/submission/17307130133/img/softmax.png b/assignment-2/submission/17307130133/img/softmax.png new file mode 100644 index 0000000000000000000000000000000000000000..e8ae7d1b57865a9c23e15176a13ecebaa105ec03 Binary files /dev/null and b/assignment-2/submission/17307130133/img/softmax.png differ diff --git a/assignment-2/submission/17307130133/img/tester.png b/assignment-2/submission/17307130133/img/tester.png new file mode 100644 index 0000000000000000000000000000000000000000..29246264c69d2decc2ed5f34df698fcd07481303 Binary files /dev/null and b/assignment-2/submission/17307130133/img/tester.png differ diff --git a/assignment-2/submission/17307130133/numpy_fnn.py b/assignment-2/submission/17307130133/numpy_fnn.py new file mode 100644 index 0000000000000000000000000000000000000000..67cc7bc6beac060ebe52fcee790214b1c9f1dc83 --- /dev/null +++ b/assignment-2/submission/17307130133/numpy_fnn.py @@ -0,0 +1,210 @@ +import numpy as np + + +class NumpyOp: + + def __init__(self): + self.memory = {} + self.epsilon = 1e-12 + + +class Matmul(NumpyOp): + + def forward(self, x, W): + """ + x: shape(N, d) + w: shape(d, d') + """ + self.memory['x'] = x + self.memory['W'] = W + h = np.matmul(x, W) + return h + + def backward(self, grad_y): + """ + grad_y: shape(N, d') + """ + + #################### + # code 1 # + #################### + grad_x = np.matmul(grad_y, self.memory['W'].T) + grad_W = np.matmul(self.memory['x'].T, grad_y) + return grad_x, grad_W + + +class Relu(NumpyOp): + + def forward(self, x): + self.memory['x'] = x + return np.where(x > 0, x, np.zeros_like(x)) + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 2 # + #################### + x = self.memory['x'] + grad_x = grad_y * np.where(x > 0, np.ones_like(x), np.zeros_like(x)) + return grad_x + + +class Log(NumpyOp): + + def forward(self, x): + """ + x: shape(N, c) + """ + + out = np.log(x + self.epsilon) + self.memory['x'] = x + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 3 # + #################### + grad_x = grad_y * np.reciprocal(self.memory['x'] + self.epsilon) + + return grad_x + + +class Softmax(NumpyOp): + """ + softmax over last dimension + """ + + def forward(self, x): + """ + x: shape(N, c) + """ + + #################### + # code 4 # + #################### + out = np.exp(x) / np.sum(np.exp(x), axis=1, keepdims=True) + self.memory['x'] = x + self.memory['out'] = out + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 5 # + #################### + out = self.memory['out'] + Jacobs = np.array([np.diag(r) - np.outer(r, r) for r in out]) + grad_y = grad_y[:, np.newaxis, :] + grad_x = np.matmul(grad_y, Jacobs).squeeze(axis=1) + return grad_x + + +class NumpyLoss: + + def __init__(self): + self.target = None + + def get_loss(self, pred, target): + self.target = target + return (-pred * target).sum(axis=1).mean() + + def backward(self): + return -self.target / self.target.shape[0] + + +class NumpyModel: + def __init__(self): + self.W1 = np.random.normal(size=(28 * 28, 256)) + self.W2 = np.random.normal(size=(256, 64)) + self.W3 = np.random.normal(size=(64, 10)) + + # 以下算子会在 forward 和 backward 中使用 + self.matmul_1 = Matmul() + self.relu_1 = Relu() + self.matmul_2 = Matmul() + self.relu_2 = Relu() + self.matmul_3 = Matmul() + self.softmax = Softmax() + self.log = Log() + + # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度( loss 关于算子输入的偏导) + self.x1_grad, self.W1_grad = None, None + self.relu_1_grad = None + self.x2_grad, self.W2_grad = None, None + self.relu_2_grad = None + self.x3_grad, self.W3_grad = None, None + self.softmax_grad = None + self.log_grad = None + + # for momentum and adam + self.W1_m = 0 + self.W2_m = 0 + self.W3_m = 0 + self.W1_v = 0 + self.W2_v = 0 + self.W3_v = 0 + + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + + #################### + # code 6 # + #################### + x = self.relu_1.forward(self.matmul_1.forward(x, self.W1)) + x = self.relu_2.forward(self.matmul_2.forward(x, self.W2)) + x = self.matmul_3.forward(x, self.W3) + x = self.softmax.forward(x) + x = self.log.forward(x) + return x + + def backward(self, y): + + #################### + # code 7 # + #################### + self.log_grad = self.log.backward(y) + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) + + def optimize(self, learning_rate): + self.W1 -= learning_rate * self.W1_grad + self.W2 -= learning_rate * self.W2_grad + self.W3 -= learning_rate * self.W3_grad + + def optimizeM(self, learning_rate, beta_1=0.9): + self.W1_m = beta_1 * self.W1_m + (1-beta_1)*self.W1_grad + self.W2_m = beta_1 * self.W2_m + (1-beta_1)*self.W2_grad + self.W3_m = beta_1 * self.W3_m + (1-beta_1)*self.W3_grad + + self.W1 -= learning_rate * self.W1_m + self.W2 -= learning_rate * self.W2_m + self.W3 -= learning_rate * self.W3_m + + def optimizeAM(self, learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8): + self.W1_m = beta_1 * self.W1_m + (1-beta_1) * self.W1_grad + self.W2_m = beta_1 * self.W2_m + (1-beta_1) * self.W2_grad + self.W3_m = beta_1 * self.W3_m + (1-beta_1) * self.W3_grad + + self.W1_v = beta_2 * self.W1_v + (1-beta_2) * self.W1_grad * self.W1_grad + self.W2_v = beta_2 * self.W2_v + (1-beta_2) * self.W2_grad * self.W2_grad + self.W3_v = beta_2 * self.W3_v + (1-beta_2) * self.W3_grad * self.W3_grad + + self.W1 -= learning_rate * self.W1_m / (self.W1_v**0.5 + epsilon) + self.W2 -= learning_rate * self.W2_m / (self.W2_v**0.5 + epsilon) + self.W3 -= learning_rate * self.W3_m / (self.W3_v**0.5 + epsilon) diff --git a/assignment-2/submission/17307130133/numpy_mnist.py b/assignment-2/submission/17307130133/numpy_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..5c231927da7f8c235a73f77285a6404cf7473a3c --- /dev/null +++ b/assignment-2/submission/17307130133/numpy_mnist.py @@ -0,0 +1,55 @@ +import numpy as np +from numpy_fnn import NumpyModel, NumpyLoss +from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot + + +def mini_batch(dataset, shuffle=True, batch_size=128): + data = np.array([each[0].numpy() for each in dataset]) + label = np.array([each[1] for each in dataset]) + data_size = data.shape[0] + + index = np.arange(data_size) + if shuffle: + np.random.shuffle(index) + + mini_batches = [] + + for i in range(0, data_size, batch_size): + mini_batches.append([data[index[i:i+batch_size]], label[index[i:i+batch_size]]]) + return mini_batches + + +def numpy_run(): + train_dataset, test_dataset = download_mnist() + + model = NumpyModel() + numpy_loss = NumpyLoss() + model.W1, model.W2, model.W3 = get_torch_initialization() + + train_loss = [] + + epoch_number = 15 + learning_rate = 0.1 + + for epoch in range(epoch_number): + for x, y in mini_batch(train_dataset): + y = one_hot(y) + + y_pred = model.forward(x) + loss = numpy_loss.get_loss(y_pred, y) + + model.backward(numpy_loss.backward()) + # model.optimize(learning_rate) + model.optimizeAM() + + train_loss.append(loss.item()) + + x, y = batch(test_dataset)[0] + accuracy = np.mean((model.forward(x).argmax(axis=1) == y)) + print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy)) + + plot_curve(train_loss) + + +if __name__ == "__main__": + numpy_run()