diff --git a/assignment-2/submission/18307130116/README.md b/assignment-2/submission/18307130116/README.md new file mode 100644 index 0000000000000000000000000000000000000000..60d6a7aaf412e4f028a1124ff7cc63b243e2c2d7 --- /dev/null +++ b/assignment-2/submission/18307130116/README.md @@ -0,0 +1,160 @@ +# FNN实现 + +[toc] + +## 模型实现 + +各算子实现参考[算子导数推导部分](##算子导数推导),网络结构如下图所示 + +![model](img/model.png) + +根据上图对应的模型,建立顺序将算子拼接在一起,并在反向传播时从loss开始逐层回传,基本没什么难点,最终模型构建了函数 + +$log(softmax(W_3\sigma(W_2\sigma(W_1X)))$ + +## 模型训练 + +在运行实现给出的`numpy_minst.py`,共运行了三个epoch,对应的准确率和loss变化情况如下 + +| epoch | Accuracy | +| ----- | -------- | +| 0 | 94.49% | +| 1 | 96.47% | +| 2 | 96.58% | + +![Figure_1](img/Figure_1.png) + +### 学习率和epoch的影响 + +观察发现,loss下降到一定范围后开始上下抖动,推测其原因为接近极值点时学习率过大,为达到更优的性能,我调小的学习率并增大了epoch数量,得到结果如下,并做了不更改学习率仅调整epoch数量的对比实验其中i为[(i-1)*5, i\*5)中位数,20为最终结果 + +| epoch | Accuracy(learning_rate = 0.1) | Accuracy(learning_rate = 0.05) | Accuracy(learning_rate = 0.1+0.05) | +| ----- | ----------------------------- | ------------------------------ | ---------------------------------- | +| 0 | 97.27% | 95.85% | 96.59% | +| 5 | 97.93% | 97.85% | 97.91% | +| 10 | 98.03% | 98.03% | 98.18% | +| 15 | 98.12% | 98.09% | 98.18% | +| 20 | 98.12% | 98.19% | 98.18% | + +
+ + + +
依次为lr=0.1, lr=0.05, lr=0.1+0.05
+
+ +可以看到,当学习率调低时,整个收敛过程变慢,在0-5个epoch,0.1的学习率已经达到了97.27%,而0.05仍在95.85%,这个结果符合预期,从最终的结果上看,lr调小收敛较慢,虽然在epoch=20时偶然达到了较高水平,但是在15-20的中位数仍然低于lr = 0.1,推测可能原因为lr过小导致epoch=20时模型收敛程度不好 + +进一步的,观察发现,该模型在epoch=10时基本已经趋向于收敛,综合考量lr=0.1收敛较快和lr=0.05步长小,最终更可能收敛到最优的极值点两个因素,我做了一个简单的trade-off,前10个epoch采用0.1的学习率,后10个epoch采用0.05,加快收敛的同时,减少在极值点附近的震荡,最终效果符合预期,epoch在15-20区间提升了0.06个百分点,从图上也能看出,在step = 6000附近震荡减小符合预期 + +在实际的训练过程中,有一系列调度方法根据梯度动态调整学习率,这个实验只是实际训练的简化版,但也应证了学习率调整的重要性 + +另一方面,epoch增多也显著提升了模型的最终表现,使得其收敛效果更好,符合预期 + +## `mini_batch`实现 + +原先的mini_batch主要是套用了PyTorch的dataloader,本质上完成的工作是给定一个batch_size,返回指定batch_size大小的数据,为了事先指定的逻辑,用numpy复现的dataloader首先将数据集中所有的内容存在一个list中,原先函数参数中的shuffle,利用numpy随机打乱。 + +由于dataloader原先的`drop_last`参数默认为False,在`mini_batch`实现中,如果dataset的总数不为batchsize的整数倍且drop_last的值为False,最后一个部分数据也会被加入进去,batch将会小一点 + +最后返回的data将会是一个[num, batch_size, 1, 28,28]的numpy数组,而label则是[num, batch_size]的numpy数组,其中num为数据集batch数量 + +## 算子导数推导 + +在这部分推导过程中将广泛采用矩阵论的做法推导对应的导数,我将该问题的本质看成了标量对矩阵复合求导的问题,采用微分性质和迹方法变换得到最终结果,即 +$$ +已有dl = tr(\frac{\partial l}{\partial Y}^T dY),Y=\sigma(X),将其化简为dl = tr(\frac{\partial l}{\partial X}^T dX) +$$ +`softmax`部分的推导最为复杂,将重点对该部分运算方法与原理细致介绍,其他算子采用的运算性质大多被`softmax`包涵,推导过程中将会省略 + +### Softmax + +(由于gitee的公式支持问题,以下为推导过程截图) + +![softmax1](img/softmax1.png) + +![softmax2](img/softmax2.png) + +### Log + +$dl = tr(\frac{\partial l}{\partial Y}^T dY),Y=log(X+\epsilon)$ + +$dY = dlog(X +\epsilon) = log'(X+\epsilon)\odot dx = \frac{1}{x+\epsilon}\odot dx$ + +$dl = tr(D_y^T*(\frac{1}{x+\epsilon}\odot dx)) = tr((D_Y\odot \frac{1}{x+\epsilon})^T*dx)$ + +$D_X = (D_Y\odot \frac{1}{x+\epsilon})$ + +### Relu + +$dl = tr(\frac{\partial l}{\partial Y}^T dY),Y=h(X)$ + +$D_{x_{ij}} = 1, x_{ij} \geq0$ + +$D_{x_{ij}} = 1, x_{ij} < 0$ + +$其余推导同log,D_X = D_Y\odot h'(x)$ + +### Matmul + +(因为gitee公式问题,这里为推导过程截图) + +matmul + + + +## 优化器 + +### Adam原理 + +类似于实验部分做的对学习率的调整,Adam优化器作为一种很多情况下常常使用到的优化器,在自动调整学习率这个点较为出彩,基本已经成为了很多模型优化问题的默认优化器,另一方面初始的学习率选择也影响到了优化过程。 + +Adam优化器的基本公式为$\theta_t = \theta_{t-1}-\alpha*\hat m_t/(\sqrt{\hat v_t}+\epsilon)$,其中$\hat m_t$以指数移动平均的方式估计样本的一阶矩,并通过超参$\beta_1$的t次方削减初始化为0导致偏差的影响,其基本公式如下,$g_t$为梯度值 + +$\hat m_t = m_t/(1-\beta_1^t)$,$m_t = \beta_1m_{t-1}+(1-\beta_1)g_t$ + +类似的计算$\hat v = v_t/(1-\beta_2^t),v_t = \beta_2v_{t-1}+(1-\beta_2)g_t^2$ + +$\epsilon$目的是为了防止除数变成0 + +### Momentum原理 + +Momentum优化器的思路和Adam类似,但是并不考虑标准差对学习率的影响,同样利用滑动窗口机制,指数加权动量,赋给当前梯度一个较小的权重,从而平滑梯度在极值点附近的摆动,更能够接近极值点 + +其公式如下 + +$v_t = \beta v_{t-1}+(1-\beta)dW$ + +$W = W - \alpha v_t$ + +### 实现 + +有了如上公式,我在`numpy_mnist.py`中设计了Adam类和Momentum类,由于并不能对`numpy_fnn.py`进行修改,对这两个优化器的实现大体思路变成了,针对每一个变量生成一个优化器,并通过内部变量记录上一轮迭代时参数信息,并计算后返回新的参数,例如Moment的使用呈如下格式: + +`model.W1 = W1_opt.optimize(model.W1, model.W1_grad)` + +即计算新的权值后,赋给模型 + +### 实验比较 + +我们将两个优化器我们同之前获得的最优结果,`lr` = 0.1+0.05方式作比较,loss和Accuracy变化如下 + +| epoch | Accuracy(learning_rate = 0.1+0.05) | Accuracy(Adam, $\alpha = 0.001$) | Accuracy(Momentum,$\alpha = 0.1$) | +| ----- | ---------------------------------- | ---------------------------------- | --------------------------------- | +| 0 | 96.59% | 97.46% | 97.01% | +| 5 | 97.91% | 97.69% | 97.95% | +| 10 | 98.18% | 97.80% | 98.07% | +| 15 | 98.18% | 97.98% | 98.22% | +| 20 | 98.18% | 98.04% | 98.36% | + +Adammomentum + +### 分析 + +从表格和loss变化情况来看,Momentum的效果明显优于手动学习率调整,而Adam的效果甚至不如恒定学习率,查看论文中的算法后,我排除了实现错误的可能性,查找了相关资料,发现了这样的一段话: + +[简单认识Adam]: https://www.jianshu.com/p/aebcaf8af76e "Adam的缺陷与改进" + +虽然Adam算法目前成为主流的优化算法,不过在很多领域里(如计算机视觉的对象识别、NLP中的机器翻译)的最佳成果仍然是使用带动量(Momentum)的SGD来获取到的。Wilson 等人的论文结果显示,在对象识别、字符级别建模、语法成分分析等方面,自适应学习率方法(包括AdaGrad、AdaDelta、RMSProp、Adam等)通常比Momentum算法效果更差。 + +根据该资料的说法,本次实验手写数字识别应划归为对象识别,自适应学习率方法确为效果更差,Adam的好处在于,对于不稳定目标函数,效果很好,因此,从这里可以看到,优化器选择应该针对实际问题类型综合考量 \ No newline at end of file diff --git a/assignment-2/submission/18307130116/img/Adam.png b/assignment-2/submission/18307130116/img/Adam.png new file mode 100644 index 0000000000000000000000000000000000000000..76c571e3ea0c18e00faf75a5f078350cb86a1159 Binary files /dev/null and b/assignment-2/submission/18307130116/img/Adam.png differ diff --git a/assignment-2/submission/18307130116/img/Figure_1.png b/assignment-2/submission/18307130116/img/Figure_1.png new file mode 100644 index 0000000000000000000000000000000000000000..683414e2e126545f2a851da9a05be74eb5261b13 Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_1.png differ diff --git a/assignment-2/submission/18307130116/img/Figure_2.png b/assignment-2/submission/18307130116/img/Figure_2.png new file mode 100644 index 0000000000000000000000000000000000000000..bef71ab36ae8d83504f84243e3d64082b8fcab5d Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_2.png differ diff --git a/assignment-2/submission/18307130116/img/Figure_3.png b/assignment-2/submission/18307130116/img/Figure_3.png new file mode 100644 index 0000000000000000000000000000000000000000..639051608449345a12b51083243e78dcfa6a4f70 Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_3.png differ diff --git a/assignment-2/submission/18307130116/img/Figure_4.png b/assignment-2/submission/18307130116/img/Figure_4.png new file mode 100644 index 0000000000000000000000000000000000000000..fe141456a1e96e256569cdcb37a87e2d4b6f0e6b Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_4.png differ diff --git a/assignment-2/submission/18307130116/img/matmul.png b/assignment-2/submission/18307130116/img/matmul.png new file mode 100644 index 0000000000000000000000000000000000000000..e3e6d769ef44203d80817a2928a5b1ea2a533e06 Binary files /dev/null and b/assignment-2/submission/18307130116/img/matmul.png differ diff --git a/assignment-2/submission/18307130116/img/model.png b/assignment-2/submission/18307130116/img/model.png new file mode 100644 index 0000000000000000000000000000000000000000..72c73828f7d70be8ea8d3f010b27bc7ada0a4139 Binary files /dev/null and b/assignment-2/submission/18307130116/img/model.png differ diff --git a/assignment-2/submission/18307130116/img/momentum.png b/assignment-2/submission/18307130116/img/momentum.png new file mode 100644 index 0000000000000000000000000000000000000000..b9b0b145e362898c6a6cf5f379fe0459abb9fa28 Binary files /dev/null and b/assignment-2/submission/18307130116/img/momentum.png differ diff --git a/assignment-2/submission/18307130116/img/softmax1.png b/assignment-2/submission/18307130116/img/softmax1.png new file mode 100644 index 0000000000000000000000000000000000000000..56c1a6c77141e66a1970dc8d7d66d00c891a74d2 Binary files /dev/null and b/assignment-2/submission/18307130116/img/softmax1.png differ diff --git a/assignment-2/submission/18307130116/img/softmax2.png b/assignment-2/submission/18307130116/img/softmax2.png new file mode 100644 index 0000000000000000000000000000000000000000..277f06da303ed92389cc7620e89ee25bf5b1c7e1 Binary files /dev/null and b/assignment-2/submission/18307130116/img/softmax2.png differ diff --git a/assignment-2/submission/18307130116/numpy_fnn.py b/assignment-2/submission/18307130116/numpy_fnn.py new file mode 100644 index 0000000000000000000000000000000000000000..13397e1977d0b8bf530900861e08a2176816f780 --- /dev/null +++ b/assignment-2/submission/18307130116/numpy_fnn.py @@ -0,0 +1,185 @@ +import numpy as np + + +class NumpyOp: + + def __init__(self): + self.memory = {} + self.epsilon = 1e-12 + + +class Matmul(NumpyOp): + + def forward(self, x, W): + """ + x: shape(N, d) + w: shape(d, d') + """ + self.memory['x'] = x + self.memory['W'] = W + h = np.matmul(x, W) + return h + + def backward(self, grad_y): + """ + grad_y: shape(N, d') + """ + + #################### + # code 1 # + #################### + grad_x = np.matmul(grad_y,self.memory['W'].T) + grad_W = np.matmul(self.memory['x'].T, grad_y) + + return grad_x, grad_W + + +class Relu(NumpyOp): + + def forward(self, x): + self.memory['x'] = x + return np.where(x > 0, x, np.zeros_like(x)) + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 2 # + #################### + grad_x = np.where(self.memory['x'] > 0, grad_y, np.zeros_like(self.memory['x'])) + return grad_x + + +class Log(NumpyOp): + + def forward(self, x): + """ + x: shape(N, c) + """ + + out = np.log(x + self.epsilon) + self.memory['x'] = x + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 3 # + #################### + grad_x =(1/(self.memory['x'] + self.epsilon)) *grad_y + + return grad_x + +class Softmax(NumpyOp): + """ + softmax over last dimension + """ + + def forward(self, x): + """ + x: shape(N, c) + """ + self.memory['x'] = x + #################### + # code 4 # + #################### + exp = np.exp(self.memory['x']) + one = np.ones((self.memory['x'].shape[1], self.memory['x'].shape[1])) + h = 1./np.matmul(exp,one) + out = h * exp + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 5 # + #################### + exp = np.exp(self.memory['x']) + one = np.ones((self.memory['x'].shape[1], self.memory['x'].shape[1])) + h = 1./np.matmul(exp,one) + h_grad = -h * h + grad_x = grad_y* exp * h + np.matmul(grad_y * exp * h_grad, one) * exp + return grad_x + + +class NumpyLoss: + + def __init__(self): + self.target = None + + def get_loss(self, pred, target): + self.target = target + return (-pred * target).sum(axis=1).mean() + + def backward(self): + return -self.target / self.target.shape[0] + + +class NumpyModel: + def __init__(self): + self.W1 = np.random.normal(size=(28 * 28, 256)) + self.W2 = np.random.normal(size=(256, 64)) + self.W3 = np.random.normal(size=(64, 10)) + + + # 以下算子会在 forward 和 backward 中使用 + self.matmul_1 = Matmul() + self.relu_1 = Relu() + self.matmul_2 = Matmul() + self.relu_2 = Relu() + self.matmul_3 = Matmul() + self.softmax = Softmax() + self.log = Log() + + # 以下变量需要在 backward 中更新 + self.x1_grad, self.W1_grad = None, None + self.relu_1_grad = None + self.x2_grad, self.W2_grad = None, None + self.relu_2_grad = None + self.x3_grad, self.W3_grad = None, None + self.softmax_grad = None + self.log_grad = None + + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + + #################### + # code 6 # + #################### + x = self.matmul_1.forward(x, self.W1) + x = self.relu_1.forward(x) + x = self.matmul_2.forward(x, self.W2) + x = self.relu_2.forward(x) + x = self.matmul_3.forward(x, self.W3) + x = self.softmax.forward(x) + x = self.log.forward(x) + return x + + def backward(self, y): + #################### + # code 7 # + #################### + self.log_grad = self.log.backward(y) + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) + + + + def optimize(self, learning_rate): + self.W1 -= learning_rate * self.W1_grad + self.W2 -= learning_rate * self.W2_grad + self.W3 -= learning_rate * self.W3_grad diff --git a/assignment-2/submission/18307130116/numpy_mnist.py b/assignment-2/submission/18307130116/numpy_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..dc5fdaa3b169f4a5ec77458993318b1b875ac400 --- /dev/null +++ b/assignment-2/submission/18307130116/numpy_mnist.py @@ -0,0 +1,97 @@ +import numpy as np +from numpy_fnn import NumpyModel, NumpyLoss +from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot + +def mini_batch(dataset, batch_size=128, numpy=False, drop_last=False): + data = [] + label = [] + dataset_num = dataset.__len__() + idx = np.arange(dataset_num) + np.random.shuffle(idx) + for each in dataset: + data.append(each[0].numpy()) + label.append(each[1]) + label_numpy = np.array(label)[idx] + data_numpy = np.array(data)[idx] + + result = [] + for iter in range(dataset_num // batch_size): + result.append((data_numpy[iter*batch_size:(iter+1)*batch_size], label_numpy[iter*batch_size:(iter+1)*batch_size])) + if drop_last == False: + result.append((data_numpy[(iter+1)*batch_size:dataset_num], label_numpy[(iter+1)*batch_size:dataset_num])) + return result + +class Adam: + def __init__(self, weight, lr=0.0015, beta1=0.9, beta2=0.999, epsilon=1e-8): + self.theta = weight + self.lr = lr + self.beta1 = beta1 + self.beta2 = beta2 + self.epislon = epsilon + self.m = 0 + self.v = 0 + self.t = 0 + + def optimize(self, grad): + self.t += 1 + self.m = self.beta1 * self.m + (1 - self.beta1) * grad + self.v = self.beta2 * self.v + (1 - self.beta2) * grad * grad + self.m_hat = self.m / (1 - self.beta1 ** self.t) + self.v_hat = self.v / (1 - self.beta2 ** self.t) + self.theta -= self.lr * self.m_hat / (self.v_hat ** 0.5 + self.epislon) + return self.theta + +class Momentum: + def __init__(self, lr=0.1, beta=0.9): + self.lr = lr + self.beta = beta + self.v = 0 + + def optimize(self, weight, grad): + self.v = self.beta*self.v + (1-self.beta)*grad + weight -= self.lr * self.v + return weight + +def numpy_run(): + train_dataset, test_dataset = download_mnist() + + model = NumpyModel() + numpy_loss = NumpyLoss() + model.W1, model.W2, model.W3 = get_torch_initialization() + W1_opt = Momentum() + W2_opt = Momentum() + W3_opt = Momentum() + + + train_loss = [] + + epoch_number = 20 + + for epoch in range(epoch_number): + for x, y in mini_batch(train_dataset): + y = one_hot(y) + + y_pred = model.forward(x) + loss = numpy_loss.get_loss(y_pred, y) + + model.backward(numpy_loss.backward()) + # if epoch >= 10: + # learning_rate = 0.05 + # else: + # learning_rate = 0.1 + # model.optimize(learning_rate) + model.W1 = W1_opt.optimize(model.W1, model.W1_grad) + model.W2 = W2_opt.optimize(model.W2, model.W2_grad) + model.W3 = W3_opt.optimize(model.W3, model.W3_grad) + + train_loss.append(loss.item()) + + x, y = batch(test_dataset)[0] + accuracy = np.mean((model.forward(x).argmax(axis=1) == y)) + print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy)) + + plot_curve(train_loss) + + +if __name__ == "__main__": + numpy_run()