diff --git a/assignment-2/submission/18307130116/README.md b/assignment-2/submission/18307130116/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..60d6a7aaf412e4f028a1124ff7cc63b243e2c2d7
--- /dev/null
+++ b/assignment-2/submission/18307130116/README.md
@@ -0,0 +1,160 @@
+# FNN实现
+
+[toc]
+
+## 模型实现
+
+各算子实现参考[算子导数推导部分](##算子导数推导),网络结构如下图所示
+
+
+
+根据上图对应的模型,建立顺序将算子拼接在一起,并在反向传播时从loss开始逐层回传,基本没什么难点,最终模型构建了函数
+
+$log(softmax(W_3\sigma(W_2\sigma(W_1X)))$
+
+## 模型训练
+
+在运行实现给出的`numpy_minst.py`,共运行了三个epoch,对应的准确率和loss变化情况如下
+
+| epoch | Accuracy |
+| ----- | -------- |
+| 0 | 94.49% |
+| 1 | 96.47% |
+| 2 | 96.58% |
+
+
+
+### 学习率和epoch的影响
+
+观察发现,loss下降到一定范围后开始上下抖动,推测其原因为接近极值点时学习率过大,为达到更优的性能,我调小的学习率并增大了epoch数量,得到结果如下,并做了不更改学习率仅调整epoch数量的对比实验其中i为[(i-1)*5, i\*5)中位数,20为最终结果
+
+| epoch | Accuracy(learning_rate = 0.1) | Accuracy(learning_rate = 0.05) | Accuracy(learning_rate = 0.1+0.05) |
+| ----- | ----------------------------- | ------------------------------ | ---------------------------------- |
+| 0 | 97.27% | 95.85% | 96.59% |
+| 5 | 97.93% | 97.85% | 97.91% |
+| 10 | 98.03% | 98.03% | 98.18% |
+| 15 | 98.12% | 98.09% | 98.18% |
+| 20 | 98.12% | 98.19% | 98.18% |
+
+
+
+
+
+ 依次为lr=0.1, lr=0.05, lr=0.1+0.05
+
+
+可以看到,当学习率调低时,整个收敛过程变慢,在0-5个epoch,0.1的学习率已经达到了97.27%,而0.05仍在95.85%,这个结果符合预期,从最终的结果上看,lr调小收敛较慢,虽然在epoch=20时偶然达到了较高水平,但是在15-20的中位数仍然低于lr = 0.1,推测可能原因为lr过小导致epoch=20时模型收敛程度不好
+
+进一步的,观察发现,该模型在epoch=10时基本已经趋向于收敛,综合考量lr=0.1收敛较快和lr=0.05步长小,最终更可能收敛到最优的极值点两个因素,我做了一个简单的trade-off,前10个epoch采用0.1的学习率,后10个epoch采用0.05,加快收敛的同时,减少在极值点附近的震荡,最终效果符合预期,epoch在15-20区间提升了0.06个百分点,从图上也能看出,在step = 6000附近震荡减小符合预期
+
+在实际的训练过程中,有一系列调度方法根据梯度动态调整学习率,这个实验只是实际训练的简化版,但也应证了学习率调整的重要性
+
+另一方面,epoch增多也显著提升了模型的最终表现,使得其收敛效果更好,符合预期
+
+## `mini_batch`实现
+
+原先的mini_batch主要是套用了PyTorch的dataloader,本质上完成的工作是给定一个batch_size,返回指定batch_size大小的数据,为了事先指定的逻辑,用numpy复现的dataloader首先将数据集中所有的内容存在一个list中,原先函数参数中的shuffle,利用numpy随机打乱。
+
+由于dataloader原先的`drop_last`参数默认为False,在`mini_batch`实现中,如果dataset的总数不为batchsize的整数倍且drop_last的值为False,最后一个部分数据也会被加入进去,batch将会小一点
+
+最后返回的data将会是一个[num, batch_size, 1, 28,28]的numpy数组,而label则是[num, batch_size]的numpy数组,其中num为数据集batch数量
+
+## 算子导数推导
+
+在这部分推导过程中将广泛采用矩阵论的做法推导对应的导数,我将该问题的本质看成了标量对矩阵复合求导的问题,采用微分性质和迹方法变换得到最终结果,即
+$$
+已有dl = tr(\frac{\partial l}{\partial Y}^T dY),Y=\sigma(X),将其化简为dl = tr(\frac{\partial l}{\partial X}^T dX)
+$$
+`softmax`部分的推导最为复杂,将重点对该部分运算方法与原理细致介绍,其他算子采用的运算性质大多被`softmax`包涵,推导过程中将会省略
+
+### Softmax
+
+(由于gitee的公式支持问题,以下为推导过程截图)
+
+
+
+
+
+### Log
+
+$dl = tr(\frac{\partial l}{\partial Y}^T dY),Y=log(X+\epsilon)$
+
+$dY = dlog(X +\epsilon) = log'(X+\epsilon)\odot dx = \frac{1}{x+\epsilon}\odot dx$
+
+$dl = tr(D_y^T*(\frac{1}{x+\epsilon}\odot dx)) = tr((D_Y\odot \frac{1}{x+\epsilon})^T*dx)$
+
+$D_X = (D_Y\odot \frac{1}{x+\epsilon})$
+
+### Relu
+
+$dl = tr(\frac{\partial l}{\partial Y}^T dY),Y=h(X)$
+
+$D_{x_{ij}} = 1, x_{ij} \geq0$
+
+$D_{x_{ij}} = 1, x_{ij} < 0$
+
+$其余推导同log,D_X = D_Y\odot h'(x)$
+
+### Matmul
+
+(因为gitee公式问题,这里为推导过程截图)
+
+
+
+
+
+## 优化器
+
+### Adam原理
+
+类似于实验部分做的对学习率的调整,Adam优化器作为一种很多情况下常常使用到的优化器,在自动调整学习率这个点较为出彩,基本已经成为了很多模型优化问题的默认优化器,另一方面初始的学习率选择也影响到了优化过程。
+
+Adam优化器的基本公式为$\theta_t = \theta_{t-1}-\alpha*\hat m_t/(\sqrt{\hat v_t}+\epsilon)$,其中$\hat m_t$以指数移动平均的方式估计样本的一阶矩,并通过超参$\beta_1$的t次方削减初始化为0导致偏差的影响,其基本公式如下,$g_t$为梯度值
+
+$\hat m_t = m_t/(1-\beta_1^t)$,$m_t = \beta_1m_{t-1}+(1-\beta_1)g_t$
+
+类似的计算$\hat v = v_t/(1-\beta_2^t),v_t = \beta_2v_{t-1}+(1-\beta_2)g_t^2$
+
+$\epsilon$目的是为了防止除数变成0
+
+### Momentum原理
+
+Momentum优化器的思路和Adam类似,但是并不考虑标准差对学习率的影响,同样利用滑动窗口机制,指数加权动量,赋给当前梯度一个较小的权重,从而平滑梯度在极值点附近的摆动,更能够接近极值点
+
+其公式如下
+
+$v_t = \beta v_{t-1}+(1-\beta)dW$
+
+$W = W - \alpha v_t$
+
+### 实现
+
+有了如上公式,我在`numpy_mnist.py`中设计了Adam类和Momentum类,由于并不能对`numpy_fnn.py`进行修改,对这两个优化器的实现大体思路变成了,针对每一个变量生成一个优化器,并通过内部变量记录上一轮迭代时参数信息,并计算后返回新的参数,例如Moment的使用呈如下格式:
+
+`model.W1 = W1_opt.optimize(model.W1, model.W1_grad)`
+
+即计算新的权值后,赋给模型
+
+### 实验比较
+
+我们将两个优化器我们同之前获得的最优结果,`lr` = 0.1+0.05方式作比较,loss和Accuracy变化如下
+
+| epoch | Accuracy(learning_rate = 0.1+0.05) | Accuracy(Adam, $\alpha = 0.001$) | Accuracy(Momentum,$\alpha = 0.1$) |
+| ----- | ---------------------------------- | ---------------------------------- | --------------------------------- |
+| 0 | 96.59% | 97.46% | 97.01% |
+| 5 | 97.91% | 97.69% | 97.95% |
+| 10 | 98.18% | 97.80% | 98.07% |
+| 15 | 98.18% | 97.98% | 98.22% |
+| 20 | 98.18% | 98.04% | 98.36% |
+
+
+
+### 分析
+
+从表格和loss变化情况来看,Momentum的效果明显优于手动学习率调整,而Adam的效果甚至不如恒定学习率,查看论文中的算法后,我排除了实现错误的可能性,查找了相关资料,发现了这样的一段话:
+
+[简单认识Adam]: https://www.jianshu.com/p/aebcaf8af76e "Adam的缺陷与改进"
+
+虽然Adam算法目前成为主流的优化算法,不过在很多领域里(如计算机视觉的对象识别、NLP中的机器翻译)的最佳成果仍然是使用带动量(Momentum)的SGD来获取到的。Wilson 等人的论文结果显示,在对象识别、字符级别建模、语法成分分析等方面,自适应学习率方法(包括AdaGrad、AdaDelta、RMSProp、Adam等)通常比Momentum算法效果更差。
+
+根据该资料的说法,本次实验手写数字识别应划归为对象识别,自适应学习率方法确为效果更差,Adam的好处在于,对于不稳定目标函数,效果很好,因此,从这里可以看到,优化器选择应该针对实际问题类型综合考量
\ No newline at end of file
diff --git a/assignment-2/submission/18307130116/img/Adam.png b/assignment-2/submission/18307130116/img/Adam.png
new file mode 100644
index 0000000000000000000000000000000000000000..76c571e3ea0c18e00faf75a5f078350cb86a1159
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Adam.png differ
diff --git a/assignment-2/submission/18307130116/img/Figure_1.png b/assignment-2/submission/18307130116/img/Figure_1.png
new file mode 100644
index 0000000000000000000000000000000000000000..683414e2e126545f2a851da9a05be74eb5261b13
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_1.png differ
diff --git a/assignment-2/submission/18307130116/img/Figure_2.png b/assignment-2/submission/18307130116/img/Figure_2.png
new file mode 100644
index 0000000000000000000000000000000000000000..bef71ab36ae8d83504f84243e3d64082b8fcab5d
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_2.png differ
diff --git a/assignment-2/submission/18307130116/img/Figure_3.png b/assignment-2/submission/18307130116/img/Figure_3.png
new file mode 100644
index 0000000000000000000000000000000000000000..639051608449345a12b51083243e78dcfa6a4f70
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_3.png differ
diff --git a/assignment-2/submission/18307130116/img/Figure_4.png b/assignment-2/submission/18307130116/img/Figure_4.png
new file mode 100644
index 0000000000000000000000000000000000000000..fe141456a1e96e256569cdcb37a87e2d4b6f0e6b
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_4.png differ
diff --git a/assignment-2/submission/18307130116/img/matmul.png b/assignment-2/submission/18307130116/img/matmul.png
new file mode 100644
index 0000000000000000000000000000000000000000..e3e6d769ef44203d80817a2928a5b1ea2a533e06
Binary files /dev/null and b/assignment-2/submission/18307130116/img/matmul.png differ
diff --git a/assignment-2/submission/18307130116/img/model.png b/assignment-2/submission/18307130116/img/model.png
new file mode 100644
index 0000000000000000000000000000000000000000..72c73828f7d70be8ea8d3f010b27bc7ada0a4139
Binary files /dev/null and b/assignment-2/submission/18307130116/img/model.png differ
diff --git a/assignment-2/submission/18307130116/img/momentum.png b/assignment-2/submission/18307130116/img/momentum.png
new file mode 100644
index 0000000000000000000000000000000000000000..b9b0b145e362898c6a6cf5f379fe0459abb9fa28
Binary files /dev/null and b/assignment-2/submission/18307130116/img/momentum.png differ
diff --git a/assignment-2/submission/18307130116/img/softmax1.png b/assignment-2/submission/18307130116/img/softmax1.png
new file mode 100644
index 0000000000000000000000000000000000000000..56c1a6c77141e66a1970dc8d7d66d00c891a74d2
Binary files /dev/null and b/assignment-2/submission/18307130116/img/softmax1.png differ
diff --git a/assignment-2/submission/18307130116/img/softmax2.png b/assignment-2/submission/18307130116/img/softmax2.png
new file mode 100644
index 0000000000000000000000000000000000000000..277f06da303ed92389cc7620e89ee25bf5b1c7e1
Binary files /dev/null and b/assignment-2/submission/18307130116/img/softmax2.png differ
diff --git a/assignment-2/submission/18307130116/numpy_fnn.py b/assignment-2/submission/18307130116/numpy_fnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..13397e1977d0b8bf530900861e08a2176816f780
--- /dev/null
+++ b/assignment-2/submission/18307130116/numpy_fnn.py
@@ -0,0 +1,185 @@
+import numpy as np
+
+
+class NumpyOp:
+
+ def __init__(self):
+ self.memory = {}
+ self.epsilon = 1e-12
+
+
+class Matmul(NumpyOp):
+
+ def forward(self, x, W):
+ """
+ x: shape(N, d)
+ w: shape(d, d')
+ """
+ self.memory['x'] = x
+ self.memory['W'] = W
+ h = np.matmul(x, W)
+ return h
+
+ def backward(self, grad_y):
+ """
+ grad_y: shape(N, d')
+ """
+
+ ####################
+ # code 1 #
+ ####################
+ grad_x = np.matmul(grad_y,self.memory['W'].T)
+ grad_W = np.matmul(self.memory['x'].T, grad_y)
+
+ return grad_x, grad_W
+
+
+class Relu(NumpyOp):
+
+ def forward(self, x):
+ self.memory['x'] = x
+ return np.where(x > 0, x, np.zeros_like(x))
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 2 #
+ ####################
+ grad_x = np.where(self.memory['x'] > 0, grad_y, np.zeros_like(self.memory['x']))
+ return grad_x
+
+
+class Log(NumpyOp):
+
+ def forward(self, x):
+ """
+ x: shape(N, c)
+ """
+
+ out = np.log(x + self.epsilon)
+ self.memory['x'] = x
+
+ return out
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 3 #
+ ####################
+ grad_x =(1/(self.memory['x'] + self.epsilon)) *grad_y
+
+ return grad_x
+
+class Softmax(NumpyOp):
+ """
+ softmax over last dimension
+ """
+
+ def forward(self, x):
+ """
+ x: shape(N, c)
+ """
+ self.memory['x'] = x
+ ####################
+ # code 4 #
+ ####################
+ exp = np.exp(self.memory['x'])
+ one = np.ones((self.memory['x'].shape[1], self.memory['x'].shape[1]))
+ h = 1./np.matmul(exp,one)
+ out = h * exp
+ return out
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 5 #
+ ####################
+ exp = np.exp(self.memory['x'])
+ one = np.ones((self.memory['x'].shape[1], self.memory['x'].shape[1]))
+ h = 1./np.matmul(exp,one)
+ h_grad = -h * h
+ grad_x = grad_y* exp * h + np.matmul(grad_y * exp * h_grad, one) * exp
+ return grad_x
+
+
+class NumpyLoss:
+
+ def __init__(self):
+ self.target = None
+
+ def get_loss(self, pred, target):
+ self.target = target
+ return (-pred * target).sum(axis=1).mean()
+
+ def backward(self):
+ return -self.target / self.target.shape[0]
+
+
+class NumpyModel:
+ def __init__(self):
+ self.W1 = np.random.normal(size=(28 * 28, 256))
+ self.W2 = np.random.normal(size=(256, 64))
+ self.W3 = np.random.normal(size=(64, 10))
+
+
+ # 以下算子会在 forward 和 backward 中使用
+ self.matmul_1 = Matmul()
+ self.relu_1 = Relu()
+ self.matmul_2 = Matmul()
+ self.relu_2 = Relu()
+ self.matmul_3 = Matmul()
+ self.softmax = Softmax()
+ self.log = Log()
+
+ # 以下变量需要在 backward 中更新
+ self.x1_grad, self.W1_grad = None, None
+ self.relu_1_grad = None
+ self.x2_grad, self.W2_grad = None, None
+ self.relu_2_grad = None
+ self.x3_grad, self.W3_grad = None, None
+ self.softmax_grad = None
+ self.log_grad = None
+
+
+ def forward(self, x):
+ x = x.reshape(-1, 28 * 28)
+
+ ####################
+ # code 6 #
+ ####################
+ x = self.matmul_1.forward(x, self.W1)
+ x = self.relu_1.forward(x)
+ x = self.matmul_2.forward(x, self.W2)
+ x = self.relu_2.forward(x)
+ x = self.matmul_3.forward(x, self.W3)
+ x = self.softmax.forward(x)
+ x = self.log.forward(x)
+ return x
+
+ def backward(self, y):
+ ####################
+ # code 7 #
+ ####################
+ self.log_grad = self.log.backward(y)
+ self.softmax_grad = self.softmax.backward(self.log_grad)
+ self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+ self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+ self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+ self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+ self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+
+
+
+ def optimize(self, learning_rate):
+ self.W1 -= learning_rate * self.W1_grad
+ self.W2 -= learning_rate * self.W2_grad
+ self.W3 -= learning_rate * self.W3_grad
diff --git a/assignment-2/submission/18307130116/numpy_mnist.py b/assignment-2/submission/18307130116/numpy_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc5fdaa3b169f4a5ec77458993318b1b875ac400
--- /dev/null
+++ b/assignment-2/submission/18307130116/numpy_mnist.py
@@ -0,0 +1,97 @@
+import numpy as np
+from numpy_fnn import NumpyModel, NumpyLoss
+from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot
+
+def mini_batch(dataset, batch_size=128, numpy=False, drop_last=False):
+ data = []
+ label = []
+ dataset_num = dataset.__len__()
+ idx = np.arange(dataset_num)
+ np.random.shuffle(idx)
+ for each in dataset:
+ data.append(each[0].numpy())
+ label.append(each[1])
+ label_numpy = np.array(label)[idx]
+ data_numpy = np.array(data)[idx]
+
+ result = []
+ for iter in range(dataset_num // batch_size):
+ result.append((data_numpy[iter*batch_size:(iter+1)*batch_size], label_numpy[iter*batch_size:(iter+1)*batch_size]))
+ if drop_last == False:
+ result.append((data_numpy[(iter+1)*batch_size:dataset_num], label_numpy[(iter+1)*batch_size:dataset_num]))
+ return result
+
+class Adam:
+ def __init__(self, weight, lr=0.0015, beta1=0.9, beta2=0.999, epsilon=1e-8):
+ self.theta = weight
+ self.lr = lr
+ self.beta1 = beta1
+ self.beta2 = beta2
+ self.epislon = epsilon
+ self.m = 0
+ self.v = 0
+ self.t = 0
+
+ def optimize(self, grad):
+ self.t += 1
+ self.m = self.beta1 * self.m + (1 - self.beta1) * grad
+ self.v = self.beta2 * self.v + (1 - self.beta2) * grad * grad
+ self.m_hat = self.m / (1 - self.beta1 ** self.t)
+ self.v_hat = self.v / (1 - self.beta2 ** self.t)
+ self.theta -= self.lr * self.m_hat / (self.v_hat ** 0.5 + self.epislon)
+ return self.theta
+
+class Momentum:
+ def __init__(self, lr=0.1, beta=0.9):
+ self.lr = lr
+ self.beta = beta
+ self.v = 0
+
+ def optimize(self, weight, grad):
+ self.v = self.beta*self.v + (1-self.beta)*grad
+ weight -= self.lr * self.v
+ return weight
+
+def numpy_run():
+ train_dataset, test_dataset = download_mnist()
+
+ model = NumpyModel()
+ numpy_loss = NumpyLoss()
+ model.W1, model.W2, model.W3 = get_torch_initialization()
+ W1_opt = Momentum()
+ W2_opt = Momentum()
+ W3_opt = Momentum()
+
+
+ train_loss = []
+
+ epoch_number = 20
+
+ for epoch in range(epoch_number):
+ for x, y in mini_batch(train_dataset):
+ y = one_hot(y)
+
+ y_pred = model.forward(x)
+ loss = numpy_loss.get_loss(y_pred, y)
+
+ model.backward(numpy_loss.backward())
+ # if epoch >= 10:
+ # learning_rate = 0.05
+ # else:
+ # learning_rate = 0.1
+ # model.optimize(learning_rate)
+ model.W1 = W1_opt.optimize(model.W1, model.W1_grad)
+ model.W2 = W2_opt.optimize(model.W2, model.W2_grad)
+ model.W3 = W3_opt.optimize(model.W3, model.W3_grad)
+
+ train_loss.append(loss.item())
+
+ x, y = batch(test_dataset)[0]
+ accuracy = np.mean((model.forward(x).argmax(axis=1) == y))
+ print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+
+ plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+ numpy_run()