diff --git a/assignment-2/submission/17307130331/README.md b/assignment-2/submission/17307130331/README.md
deleted file mode 100644
index abd8de5834bacc838e1b813905da469a8d9168c3..0000000000000000000000000000000000000000
--- a/assignment-2/submission/17307130331/README.md
+++ /dev/null
@@ -1,343 +0,0 @@
-# 实验报告
-
-陈疏桐   17307130331
-
-本次实验，我用numpy实现了Matmul、log、softmax和relu四个算子的前向计算与后向计算，用四个算子构建分类模型，通过了自动测试，并实现了mini_batch函数，在mnist数据集上用不同的学习率与Batch大小进行训练和测试，讨论学习率与Batch大小对模型训练效果的影响。最后，我还实现Momentum、RMSProp与Adam三种优化方法，与传统梯度下降进行比较。
-
-## 算子的反向传播与实现
-### Matmul
-
-Matmul是矩阵的乘法，在模型中的作用相当于pytorch的一个线性层，前向传播的公式是：
-
-$$ \mathrm{Y} = \mathrm{X}\mathrm{W} $$
-
-其中，$\mathrm{X}$是形状为 $N \times d$的输入矩阵，$\mathrm{W}$是形状为$d \times d'$的矩阵， $\mathrm{Y}$是形状为$N\times d'$的输出矩阵。Matmul算子相当于输入维度为$d$、输出$d'$维的线性全连接层。
-
-Matmul分别对输入求偏导，有
-
-$$ \frac{\partial \mathrm{Y}}{\partial \mathrm{X}} = \frac{\partial \mathrm{X}\mathrm{W}}{\partial \mathrm{X}} = \mathrm{W}^T$$
-
-$$ \frac{\partial \mathrm{Y}}{\partial \mathrm{W}} = \frac{\partial \mathrm{X}\mathrm{W}}{\partial \mathrm{W}} = \mathrm{X}^T $$
-
-则根据链式法则，反向传播的计算公式为：
-
-$$ \triangledown{\mathrm{X}} = \triangledown{\mathrm{Y}} \times \mathrm{W}^T $$
-$$ \triangledown{\mathrm{W}} = \mathrm{X}^T \times \triangledown{\mathrm{Y}} $$
-
-### Relu 
-
-Relu函数对输入每一个元素的公式是：
-
-$$ \mathrm{Y}_{ij}=
-\begin{cases}
-\mathrm{X}_{ij} & \mathrm{X}_{ij} \ge 0 \\\\
-0 & \text{otherwise}
-\end{cases} 
-$$
-
-
-每一个输出 $\mathrm{Y}_{ij}$都只与输入$\mathrm{X}_{ij}$有关。则$\mathrm{X}$每一个元素的导数也只和对应的输出有关，为：
-
-$$ \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}} = 
-\begin{cases}
-1 & \mathrm{X}_{ij} \ge 0 \\\\
-0 & \text{otherwise}
-\end{cases}$$ 
-
-因此，根据链式法则，输入的梯度为：
-
-$$ \triangledown{\mathrm{X}_{ij}} = \triangledown{\mathrm{Y}_{ij}} \times \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}}$$
-
-### Log
-
-Log 函数公式：
-
-$$ \mathrm{Y}_{ij} = \log(\mathrm{X}_{ij} + \epsilon)$$
-
-$$ \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}} = \frac{1}{(\mathrm{X}_{ij} + \epsilon)} $$
-
-类似地，反向传播的计算公式为：
-
-$$ \triangledown{\mathrm{X}_{ij}} = \triangledown{\mathrm{Y}_{ij}} \times \frac{\partial \mathrm{Y}_{ij}}{\partial \mathrm{X}_{ij}}$$
-
-### Softmax
-
-Softmax对输入$\mathrm{X}$的最后一个维度进行计算。前向传播的计算公式为：
-
-$$ \mathrm{Y}_{ij} = \frac{\exp^{\mathrm{X}_{ij}}}{\sum_{k} \exp ^ {\mathrm{X}_{ik}}}$$
-
-从公式可知，Softmax的每一行输出都是独立计算的，与其它行的输入无关。而对于同一行，每一个输出都与每一个输入元素有关。以行$k$为例，可推得输出元素对输入元素求导的计算公式是：
-
-$$\frac{\partial Y_{ki}}{\partial X_{kj}} = \begin{cases}
-\frac{\exp ^ {X_{kj}} \times (\sum_{t \ne j}{\exp ^ {X_{kt}}}) }{(\sum_{t}{\exp ^ {X_{kt}}})^2} = Y_{kj}(1-Y_{kj}) & i = j \\\\
--\frac{\exp^{X_{ki} }\exp^{X_{kj} }}{(\sum_t\exp^{X_{kt}})^2}=-Y_{ki} \times Y_{kj} & i\ne j
-\end{cases}$$
-
-可得每行输出$\mathrm{Y}_{k}$与每行输入$\mathrm{X}_{k}$的Jacob矩阵$\mathrm{J}_{k}$， $\mathrm{J_{k}}_{ij} = \frac{\partial \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}$.
-
-输出的一行对于输入$\mathrm{X}_{kj}$的导数，是输出每一行所有元素对其导数相加，即$\sum_{i} {\frac{\partial \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}}$ 的结果。
-
-因此，根据链式法则，可得到反向传播的计算公式为：
-$$ \triangledown \mathrm{X}_{kj} = \sum_{i} {\frac{\partial \mathrm{Y}_{ki} \times \triangledown \mathrm{Y}_{ki}}{\partial \mathrm{X}_{kj}}}$$
-
-相当于：
-
-$$ \triangledown \mathrm{X}_{k} = \mathrm{J}_{k} \times \triangledown \mathrm{Y}_{k} $$
-
-在实现时，可以用`numpy`的`matmul`操作实现对最后两个维度的矩阵相乘，得到的矩阵堆叠起来，得到最后的结果。
-
-
-## 模型构建与训练
-### 模型构建
-
-参照`torch_mnist.py`中的`torch_model`，`numpy`模型的构建只需要将其中的算子换成我们实现的算子：
-```
-def forward(self, x):
-    x = x.reshape(-1, 28 * 28)
-
-    x = self.relu_1.forward(self.matmul_1.forward(x, self.W1))
-    x = self.relu_2.forward(self.matmul_2.forward(x, self.W2))
-
-    x = self.matmul_3.forward(x, self.W3)
-
-    x = self.softmax.forward(x)
-    x = self.log.forward(x)
-
-    return x
-```
-
-模型的computation graph是：
-![compu_graph](img/compu_graph.png)
-
-根据计算图，可以应用链式法则，推导出各个叶子变量（$\mathrm{W}_{1}, \mathrm{W}_{2}, \mathrm{W}_{3}, \mathrm{X}$）以及中间变量的计算方法。
-
-反向传播的计算图为：
-![backpropagration](img/backgraph.png)
-
-可根据计算图完成梯度的计算：
-```
-def backward(self, y):
-    self.log_grad = self.log.backward(y)
-    self.softmax_grad = self.softmax.backward(self.log_grad)
-    self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
-    self.relu_2_grad = self.relu_2.backward(self.x3_grad)
-    self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
-    self.relu_1_grad = self.relu_1.backward(self.x2_grad)
-    self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
-```
-
-### MiniBatch
-
-在`utils`中的`mini_batch`方法，直接调用了`pytorch`的`DataLoader`。 `DataLoader`是一个负责从数据集中读取样本、组合成批次输出的方法。简单地使用`DataLoader`， 可以方便地多线程并行化预取数据，加快训练速度，且节省代码。`DataLoader`还可以自定义`Sampler`，以不同的方式从数据集中进行采样，以及`BatchSampler`以自定的方式将采集的样本组合成批，这样就可以实现在同一Batch内将数据补0、自定义Batch正负样本混合比例等操作。
-
-在这里，我们模仿`DataLoader`的默认行为实现`mini_batch`方法。
-```
-def mini_batch(dataset, batch_size=128):
-    data = np.array([each[0].numpy() for each in dataset]) # 需要先处理数据
-    label = np.array([each[1] for each in dataset])
-    
-    data_size = data.shape[0]
-    idx = np.array([i for i in range(data_size)])
-    np.random.shuffle(idx)   # 打乱顺序
-    
-    return [(data[idx[i: i+batch_size]], label[idx[i:i+batch_size]])  for i in range(0, data_size, batch_size)]  # 这里相当于DataLoader 的BatchSampler，但一次性调用
-```
-
-### 模型训练
-
-构建模型，设置`epoch=10`, `learning_rate=0.1`, `batch_size=128`后，开始训练。训练时每次fit一个batch的数据，前向传播计算输出，然后根据输出计算loss，再调用`loss.backward`计算loss对输出的求导，即模型输出的梯度，之后就可以调用模型的`backward`进行后向计算。 最后调用模型的`optimize`更新参数。
-
-训练过程：
-![train10](img/train10.png)
- 
-各个epoch的测试准确率为：
-```
-[0] Test Accuracy: 0.9437
-[1] Test Accuracy: 0.9651
-[2] Test Accuracy: 0.9684
-[3] Test Accuracy: 0.9730
-[4] Test Accuracy: 0.9755
-[5] Test Accuracy: 0.9775
-[6] Test Accuracy: 0.9778
-[7] Test Accuracy: 0.9766
-[8] Test Accuracy: 0.9768
-[9] Test Accuracy: 0.9781
-```
-
-将`learning_rate` 调整到0.2，重新训练：
-![train02](img/train02.png)
-
-各个epoch的测试准确率为：
-```
-[0] Test Accuracy: 0.9621
-[1] Test Accuracy: 0.9703
-[2] Test Accuracy: 0.9753
-[3] Test Accuracy: 0.9740
-[4] Test Accuracy: 0.9787
-[5] Test Accuracy: 0.9756
-[6] Test Accuracy: 0.9807
-[7] Test Accuracy: 0.9795
-[8] Test Accuracy: 0.9814
-[9] Test Accuracy: 0.9825
-```
-
-可见，稍微提高学习率之后，训练前期参数更新的幅度更大，损失下降得更快，能够更早收敛。训练相同迭代数，现在的模型测试准确率更高。
-
-将`learning_rate` 提高到0.3，重新训练：
-![train03](img/train03.png)
-
-```
-[0] Test Accuracy: 0.9554
-[1] Test Accuracy: 0.9715
-[2] Test Accuracy: 0.9744
-[3] Test Accuracy: 0.9756
-[4] Test Accuracy: 0.9782
-[5] Test Accuracy: 0.9795
-[6] Test Accuracy: 0.9801
-[7] Test Accuracy: 0.9816
-[8] Test Accuracy: 0.9828
-[9] Test Accuracy: 0.9778
-```
-
-增大学习率到0.3之后，训练前期损失下降速度与上一次训练差不多，但是到了训练后期，过大的学习率导致权重在局部最小值的附近以过大的幅度移动，难以进入最低点，模型loss表现为振荡，难以收敛。本次训练的测试准确率先提高到0.9828，后反而下降。
-
-因此，可认为对于大小为128的batch，0.2是较为合适的学习率。
-
-之后，维持学习率为0.2， 修改batch_size 为256， 重新训练：
-![train256](img/train256.png)
-```
-[0] Test Accuracy: 0.9453
-[1] Test Accuracy: 0.9621
-[2] Test Accuracy: 0.9657
-[3] Test Accuracy: 0.9629
-[4] Test Accuracy: 0.9733
-[5] Test Accuracy: 0.9766
-[6] Test Accuracy: 0.9721
-[7] Test Accuracy: 0.9768
-[8] Test Accuracy: 0.9724
-[9] Test Accuracy: 0.9775
-```
-
-batch_size增大后，每个batch更新一次参数，参数更新的频率更低，从而收敛速度有所降低；但是对比本次实验与前几次实验loss的曲线图，可发现振荡幅度更小。
-
-将batch_size减小到64， 重新实验：
-![train64](img/train64.png)
-```
-[0] Test Accuracy: 0.9526
-[1] Test Accuracy: 0.9674
-[2] Test Accuracy: 0.9719
-[3] Test Accuracy: 0.9759
-[4] Test Accuracy: 0.9750
-[5] Test Accuracy: 0.9748
-[6] Test Accuracy: 0.9772
-[7] Test Accuracy: 0.9791
-[8] Test Accuracy: 0.9820
-[9] Test Accuracy: 0.9823
-```
-
-loss的下降速度增加，但是振荡幅度变大了。
-
-总结：在一定范围之内，随着学习率的增大，模型收敛速度增加；随着batch_size的减小，模型收敛速度也会有一定增加，但是振荡幅度增大。 学习率过大会导致后期loss振荡、难以收敛；学习率过小则会导致loss下降速度过慢，甚至可能陷入局部最小值而错过更好的最低点。
-
-## 其他优化方式实现
-
-### momentum
-
-普通梯度下降每次更新参数仅仅取决于当前batch的梯度，这可能会让梯度方向受到某些特殊的输入影响。Momentum引入了动量，让当前更新不仅取决于当前的梯度，还考虑到先前的梯度，能够在一定程度上保持一段时间的趋势。momentum的计算方式为：
-
-$$
-\begin{align}
-& v = \alpha v - \gamma \frac{\partial L}{\partial W} \\\\
-& W = W + v
-\end{align}
-$$
-
-我们在`numpy_fnn.py`的模型中实现了Momentum的优化方法。 设置学习率为0.02，batch_size为128， 继续实验：
-![momentum](img/momentum.png)
-```
-[0] Test Accuracy: 0.9586
-[1] Test Accuracy: 0.9717
-[2] Test Accuracy: 0.9743
-[3] Test Accuracy: 0.9769
-[4] Test Accuracy: 0.9778
-[5] Test Accuracy: 0.9786
-[6] Test Accuracy: 0.9782
-[7] Test Accuracy: 0.9809
-[8] Test Accuracy: 0.9790
-[9] Test Accuracy: 0.9818
-```
-
-momentum 相比传统梯度下降，不一定最后会得到更好的效果。当加入动量，当前梯度方向与动量方向相同时，参数就会得到更大幅度的调整，因此loss下降速度更快，并且前期动量基本上会积累起来，如果使用过大的学习率，很容易会溢出。所以momentum适合的学习率比普通梯度下降要小一个数量级。 而当梯度方向错误的时候，加入动量会使得参数来不及更新，从而错过最小值。
-
-### RMSProp
-
-
-RMSProp引入了自适应的学习率调节。 在训练前期，学习率应该较高，使得loss能快速下降；但随着训练迭代增加，学习率应该不断减小，使得模型能够更好地收敛。 自适应调整学习率的基本思路是根据梯度来调节，梯度越大，学习率就衰减得越快；后期梯度减小，学习率衰减就更加缓慢。
-
-而为了避免前期学习率衰减得过快，RMSProp还用了指数平均的方法，来缓慢丢弃原来的梯度历史。计算方法为：
-
-$$
-\begin{align}
-& h = \rho h + (1-\rho) \frac{\partial L}{\partial W} \odot \frac{\partial L}{\partial W} \\\\
-& W = W - \gamma \frac{1}{\sqrt{\delta + h}} \frac{\partial L}{\partial W}
-\end{align}$$
-
-设置梯度为0.001， weight_decay 为0.01， 进行训练和测试：
-![rmsprop](img/rmsprop.png)
-
-```
-[0] Test Accuracy: 0.9663
-[1] Test Accuracy: 0.9701
-[2] Test Accuracy: 0.9758
-[3] Test Accuracy: 0.9701
-[4] Test Accuracy: 0.9748
-[5] Test Accuracy: 0.9813
-[6] Test Accuracy: 0.9813
-[7] Test Accuracy: 0.9819
-[8] Test Accuracy: 0.9822
-[9] Test Accuracy: 0.9808
-```
-
-可见，在训练的中间部分，loss振荡幅度比普通梯度下降更小。训练前期，模型的收敛速度更快，但到后期比起普通梯度下降并无明显优势。
-
-### Adam
-
-Adam 同时结合了动量与自适应的学习率调节。Adam首先要计算梯度的一阶和二阶矩估计，分别代表了动量与自适应的部分：
-
-$$
-\begin{align}
-& \mathrm{m} = \beta_1 \mathrm{m} + (1-\beta_1) \frac{\partial L}{\partial W} \\\\
-& \mathrm{v} = \beta_2 \mathrm{v} + (1-\beta_2) \frac{\partial L}{\partial W} \odot \frac{\partial L}{\partial W}
-\end{align}
-$$
-
-然后进行修正：
-
-$$
-\begin{align}
-& \mathrm{\hat{m}} = \frac{\mathrm{m}}{1-\beta_1 ^ t }\\\\
-& \mathrm{\hat{v}} = \frac{\mathrm{v}}{1-\beta_2 ^ t}
-\end{align}
-$$
-
-最后，参数的更新为：
-$$ W = W - \gamma \frac{\mathrm{\hat m}}{\sqrt{\mathrm{\hat v}+ \delta}}$$
-
-
-设置学习率为0.001， batch_size为128， 开始训练：
-![adam](img/train_adam.png)
-```
-[0] Test Accuracy: 0.9611
-[1] Test Accuracy: 0.9701
-[2] Test Accuracy: 0.9735
-[3] Test Accuracy: 0.9752
-[4] Test Accuracy: 0.9787
-[5] Test Accuracy: 0.9788
-[6] Test Accuracy: 0.9763
-[7] Test Accuracy: 0.9790
-[8] Test Accuracy: 0.9752
-[9] Test Accuracy: 0.9806
-
-```
-
-相比传统梯度下降，loss振荡略微有所减小，前期loss下降速度略微更快，但是最后收敛的速度相当。
\ No newline at end of file
diff --git a/assignment-2/submission/17307130331/img/backgraph.png b/assignment-2/submission/17307130331/img/backgraph.png
deleted file mode 100644
index c4a70b28e869708641bd01dba83730ed62ab9c4d..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/backgraph.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/compu_graph.png b/assignment-2/submission/17307130331/img/compu_graph.png
deleted file mode 100644
index 74f02ff1b4c4795c99600fb2e358d23a170f11c1..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/compu_graph.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/momentum.png b/assignment-2/submission/17307130331/img/momentum.png
deleted file mode 100644
index 152bfe4eda8bf98cb271e9e3af3801f223273ec2..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/momentum.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/rmsprop.png b/assignment-2/submission/17307130331/img/rmsprop.png
deleted file mode 100644
index d4c9f6d651ea0dcac312c3a7dcb38266a477679c..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/rmsprop.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/train.png b/assignment-2/submission/17307130331/img/train.png
deleted file mode 100644
index 618816332b78c4f0498444a42dd2a5028df91ef1..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/train.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/train02.png b/assignment-2/submission/17307130331/img/train02.png
deleted file mode 100644
index a2cbc7b9ccbf2f28955902b86881d7a640f50fa7..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/train02.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/train03.png b/assignment-2/submission/17307130331/img/train03.png
deleted file mode 100644
index 41dd8fd9060e6774b983375f3b025ee6335b9f66..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/train03.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/train10.png b/assignment-2/submission/17307130331/img/train10.png
deleted file mode 100644
index a2056ba0d21f8f40fc0279e532fd6b9f1ff79cef..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/train10.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/train256.png b/assignment-2/submission/17307130331/img/train256.png
deleted file mode 100644
index 81aa1b2bcc7f708607f8c402f9f41d579793f9e1..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/train256.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/train64.png b/assignment-2/submission/17307130331/img/train64.png
deleted file mode 100644
index 8f34749c6fda428437ff3fe11292b0213eca0d7a..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/train64.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/train_adam.png b/assignment-2/submission/17307130331/img/train_adam.png
deleted file mode 100644
index eefa8b27deb6485f895033add750f018fd14e293..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/train_adam.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/img/trainloss.png b/assignment-2/submission/17307130331/img/trainloss.png
deleted file mode 100644
index b845297f03d5d6e6ae2b026b25554519a77f471b..0000000000000000000000000000000000000000
Binary files a/assignment-2/submission/17307130331/img/trainloss.png and /dev/null differ
diff --git a/assignment-2/submission/17307130331/numpy_fnn.py b/assignment-2/submission/17307130331/numpy_fnn.py
deleted file mode 100644
index 7b32d95b7825b4787f5d226ac058c0039aee4bba..0000000000000000000000000000000000000000
--- a/assignment-2/submission/17307130331/numpy_fnn.py
+++ /dev/null
@@ -1,208 +0,0 @@
-import numpy as np
-
-
-class NumpyOp:
-    
-    def __init__(self):
-        self.memory = {}
-        self.epsilon = 1e-12
-
-
-class Matmul(NumpyOp):
-    
-    def forward(self, x, W):
-        """
-        x: shape(N, d)
-        w: shape(d, d')
-        """
-        self.memory['x'] = x
-        self.memory['W'] = W
-        h = np.matmul(x, W)
-        return h
-    
-    def backward(self, grad_y):
-        """
-        grad_y: shape(N, d')
-        """
-        
-        ####################
-        #      code 1      #
-        grad_W = np.matmul(self.memory['x'].T, grad_y)
-        grad_x = np.matmul(grad_y, self.memory['W'].T)
-        ####################
-        
-        return grad_x, grad_W
-
-
-class Relu(NumpyOp):
-    
-    def forward(self, x):
-        self.memory['x'] = x
-        return np.where(x > 0, x, np.zeros_like(x))
-    
-    def backward(self, grad_y):
-        """
-        grad_y: same shape as x
-        """
-        
-        ####################
-        #      code 2      #
-        ####################
-        grad_x = np.where(self.memory['x'] > 0, np.ones_like(self.memory['x']), np.zeros_like(self.memory['x'])) * grad_y # 元素乘积
-        
-        return grad_x
-
-
-class Log(NumpyOp):
-    
-    def forward(self, x):
-        """
-        x: shape(N, c)
-        """
-        
-        out = np.log(x + self.epsilon)
-        self.memory['x'] = x
-        
-        return out
-    
-    def backward(self, grad_y):
-        """
-        grad_y: same shape as x
-        """
-        
-        ####################
-        #      code 3      #
-        ####################
-        grad_x = (1/(self.memory['x'] + self.epsilon)) * grad_y
-        return grad_x
-
-
-class Softmax(NumpyOp):
-    """
-    softmax over last dimension
-    """
-    
-    def forward(self, x):
-        """
-        x: shape(N, c)
-        """
-        
-        ####################
-        #      code 4      #
-        ####################
-        exp_x = np.exp(x)
-        out = exp_x/np.sum(exp_x, axis=1, keepdims=True)
-        self.memory['x'] = x
-        self.memory['out'] = out
-        return out
-    
-    def backward(self, grad_y):
-        """
-        grad_y: same shape as x
-        """
-        o = self.memory['out']
-        Jacob = np.array([np.diag(r) - np.outer(r, r) for r in o]) 
-        # i!=j  - oi* oj
-        # i==j  oi*(1-oi)
-        grad_y = grad_y[:, np.newaxis, :]
-        grad_x = np.matmul(grad_y, Jacob).squeeze(1)
-        #print(grad_x.shape)
-        #print(grad_x)
-        return grad_x
-
-
-class NumpyLoss:
-    
-    def __init__(self):
-        self.target = None
-    
-    def get_loss(self, pred, target):
-        self.target = target
-        return (-pred * target).sum(axis=1).mean()
-    
-    def backward(self):
-        return -self.target / self.target.shape[0]
-
-
-class NumpyModel:
-    def __init__(self):
-        self.W1 = np.random.normal(size=(28 * 28, 256))
-        self.W2 = np.random.normal(size=(256, 64))
-        self.W3 = np.random.normal(size=(64, 10))
-        
-        # 以下算子会在 forward 和 backward 中使用
-        self.matmul_1 = Matmul()
-        self.relu_1 = Relu()
-        self.matmul_2 = Matmul()
-        self.relu_2 = Relu()
-        self.matmul_3 = Matmul()
-        self.softmax = Softmax()
-        self.log = Log()
-        
-        # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度（ loss 关于算子输入的偏导）
-        self.x1_grad, self.W1_grad = None, None
-        self.relu_1_grad = None
-        self.x2_grad, self.W2_grad = None, None
-        self.relu_2_grad = None
-        self.x3_grad, self.W3_grad = None, None
-        self.softmax_grad = None
-        self.log_grad = None
-        
-        # 以下变量是在 momentum\rmsprop中使用的
-        self.v1 = np.zeros_like(self.W1)
-        self.v2 = np.zeros_like(self.W2)
-        self.v3 = np.zeros_like(self.W3)
-        
-    
-    def forward(self, x):
-        x = x.reshape(-1, 28 * 28)
-        
-        x = self.relu_1.forward(self.matmul_1.forward(x, self.W1))
-        x = self.relu_2.forward(self.matmul_2.forward(x, self.W2))
-        
-        x = self.matmul_3.forward(x, self.W3)
-        
-        x = self.softmax.forward(x)
-        x = self.log.forward(x)
-        
-        return x
-    
-    def backward(self, y):
-        self.log_grad = self.log.backward(y)
-        self.softmax_grad = self.softmax.backward(self.log_grad)
-        self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
-        self.relu_2_grad = self.relu_2.backward(self.x3_grad)
-        self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
-        self.relu_1_grad = self.relu_1.backward(self.x2_grad)
-        self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
-        
-    
-    def optimize(self, learning_rate):
-        self.W1 -= learning_rate * self.W1_grad
-        self.W2 -= learning_rate * self.W2_grad
-        self.W3 -= learning_rate * self.W3_grad
-        
-    def momentum(self, learning_rate, alpha=0.9):
-        self.v1 = self.v1 * alpha - learning_rate * self.W1_grad
-        self.v2 = self.v2 * alpha - learning_rate * self.W2_grad
-        self.v3 = self.v3 * alpha - learning_rate * self.W3_grad
-        
-        self.W1 += self.v1
-        self.W2 += self.v2
-        self.W3 += self.v3
-    
-    def RMSProp(self, learning_rate, weight_decay = 0.99):
-        self.v1 = self.v1 * weight_decay + (1-weight_decay) * self.W1_grad * self.W1_grad
-        self.v2 = self.v2 * weight_decay + (1-weight_decay) * self.W2_grad * self.W2_grad
-        self.v3 = self.v3 * weight_decay + (1-weight_decay) * self.W3_grad * self.W3_grad
-        
-        self.W1 = self.W1 - learning_rate * self.W1_grad / np.sqrt( self.v1 + 1e-7)
-        self.W2 = self.W2 - learning_rate * self.W2_grad / np.sqrt( self.v2 + 1e-7)
-        self.W3 = self.W3 - learning_rate * self.W3_grad / np.sqrt( self.v3 + 1e-7)
-    
-    
-
-        
-        
-        
-        
\ No newline at end of file
diff --git a/assignment-2/submission/17307130331/numpy_mnist.py b/assignment-2/submission/17307130331/numpy_mnist.py
deleted file mode 100644
index 4187f01eeebbbcd6ab48bfacf8dedc37085e46e2..0000000000000000000000000000000000000000
--- a/assignment-2/submission/17307130331/numpy_mnist.py
+++ /dev/null
@@ -1,70 +0,0 @@
-import numpy as np
-from numpy_fnn import NumpyModel, NumpyLoss
-from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot
-
-def mini_batch(dataset, batch_size=128):
-    data = np.array([each[0].numpy() for each in dataset])
-    label = np.array([each[1] for each in dataset])
-
-    data_size = data.shape[0]
-    idx = np.array([i for i in range(data_size)])
-    np.random.shuffle(idx)
-    
-    return [(data[idx[i: i+batch_size]], label[idx[i:i+batch_size]])  for i in range(0, data_size, batch_size)]
-
-class Adam():
-    def __init__(self, param, learning_rate=0.001, beta_1=0.9, beta_2=0.999):
-        self.param = param
-        self.iter = 0
-        self.m = 0
-        self.v = 0
-        self.beta1 = beta_1
-        self.beta2 = beta_2
-        self.lr = learning_rate
-    def optimize(self, grad):
-        self.iter+=1
-        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
-        self.v = self.beta2 * self.v + (1 - self.beta2) * grad * grad
-        m_hat = self.m / (1 - self.beta1 ** self.iter)
-        v_hat = self.v / (1 - self.beta2 ** self.iter)
-        self.param -= self.lr * m_hat / (v_hat ** 0.5 + 1e-8)
-        return self.param
-        
-def numpy_run():
-    train_dataset, test_dataset = download_mnist()
-    
-    model = NumpyModel()
-    numpy_loss = NumpyLoss()
-    model.W1, model.W2, model.W3 = get_torch_initialization()
-    
-    W1_opt, W2_opt, W3_opt = Adam(model.W1), Adam(model.W2), Adam(model.W3)
-    
-    train_loss = []
-    
-    epoch_number = 10
-    learning_rate = 0.0015
-    
-    for epoch in range(epoch_number):
-        for x, y in mini_batch(train_dataset, batch_size=128):
-            y = one_hot(y)
-            
-            y_pred = model.forward(x)
-            loss = numpy_loss.get_loss(y_pred, y)
-
-            model.backward(numpy_loss.backward())
-            #model.Adam(learning_rate)
-            W1_opt.optimize(model.W1_grad)
-            W2_opt.optimize(model.W2_grad)
-            W3_opt.optimize(model.W3_grad)
-            
-            train_loss.append(loss.item())
-        
-        x, y = batch(test_dataset)[0]
-        accuracy = np.mean((model.forward(x).argmax(axis=1) == y))
-        print('[{}] Test Accuracy: {:.4f}'.format(epoch, accuracy))
-    
-    plot_curve(train_loss)
-            
-
-if __name__ == "__main__":
-    numpy_run()
diff --git a/assignment-2/submission/17307130331/tester_demo.py b/assignment-2/submission/17307130331/tester_demo.py
deleted file mode 100644
index 515b86c1240eebad83287461548530c944f23bc8..0000000000000000000000000000000000000000
--- a/assignment-2/submission/17307130331/tester_demo.py
+++ /dev/null
@@ -1,182 +0,0 @@
-import numpy as np
-import torch
-from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log
-
-from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss
-from torch_mnist import TorchModel
-from utils import get_torch_initialization, one_hot
-
-err_epsilon = 1e-6
-err_p = 0.4
-
-
-def check_result(numpy_result, torch_result=None):
-    if isinstance(numpy_result, list) and torch_result is None:
-        flag = True
-        for (n, t) in numpy_result:
-            flag = flag and check_result(n, t)
-        return flag
-    # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item())
-    T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item()
-    direction = T / torch_result.numel() < err_p
-    return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item()
-
-
-def case_1():
-    x = np.random.normal(size=[5, 6])
-    W = np.random.normal(size=[6, 4])
-    
-    numpy_matmul = Matmul()
-    numpy_out = numpy_matmul.forward(x, W)
-    numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out))
-    
-    torch_x = torch.from_numpy(x).clone().requires_grad_()
-    torch_W = torch.from_numpy(W).clone().requires_grad_()
-    
-    torch_out = torch_matmul(torch_x, torch_W)
-    torch_out.sum().backward()
-    
-    return check_result([
-        (numpy_out, torch_out),
-        (numpy_x_grad, torch_x.grad),
-        (numpy_W_grad, torch_W.grad)
-    ])
-
-
-def case_2():
-    x = np.random.normal(size=[5, 6])
-    
-    numpy_relu = Relu()
-    numpy_out = numpy_relu.forward(x)
-    numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out))
-    
-    torch_x = torch.from_numpy(x).clone().requires_grad_()
-    
-    torch_out = torch_relu(torch_x)
-    torch_out.sum().backward()
-    
-    return check_result([
-        (numpy_out, torch_out),
-        (numpy_x_grad, torch_x.grad),
-    ])
-
-
-def case_3():
-    x = np.random.uniform(low=0.0, high=1.0, size=[3, 4])
-    
-    numpy_log = Log()
-    numpy_out = numpy_log.forward(x)
-    numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out))
-    
-    torch_x = torch.from_numpy(x).clone().requires_grad_()
-    
-    torch_out = torch_log(torch_x)
-    torch_out.sum().backward()
-    
-    return check_result([
-        (numpy_out, torch_out),
-        
-        (numpy_x_grad, torch_x.grad),
-    ])
-
-
-def case_4():
-    x = np.random.normal(size=[4, 5])
-    
-    numpy_softmax = Softmax()
-    numpy_out = numpy_softmax.forward(x)
-    
-    torch_x = torch.from_numpy(x).clone().requires_grad_()
-    
-    torch_out = torch_softmax(torch_x, 1)
-    
-    return check_result(numpy_out, torch_out)
-
-
-def case_5():
-    x = np.random.normal(size=[20, 25])
-    
-    numpy_softmax = Softmax()
-    numpy_out = numpy_softmax.forward(x)
-    numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out))
-    
-    torch_x = torch.from_numpy(x).clone().requires_grad_()
-
-    torch_out = torch_softmax(torch_x, 1)
-    torch_out.sum().backward()
-
-    return check_result([
-        (numpy_out, torch_out),
-        (numpy_x_grad, torch_x.grad),
-    ])
-
-
-def test_model():
-    try:
-        numpy_loss = NumpyLoss()
-        numpy_model = NumpyModel()
-        torch_model = TorchModel()
-        torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False)
-        numpy_model.W1 = torch_model.W1.detach().clone().numpy()
-        numpy_model.W2 = torch_model.W2.detach().clone().numpy()
-        numpy_model.W3 = torch_model.W3.detach().clone().numpy()
-        
-        x = torch.randn((10000, 28, 28))
-        y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000)
-        
-        y = one_hot(y, numpy=False)
-        x2 = x.numpy()
-        y_pred = torch_model.forward(x)
-        loss = (-y_pred * y).sum(dim=1).mean()
-        loss.backward()
-        
-        y_pred_numpy = numpy_model.forward(x2)
-        numpy_loss.get_loss(y_pred_numpy, y.numpy())
-        
-        check_flag_1 = check_result(y_pred_numpy, y_pred)
-        print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10))
-    except:
-        print("[Runtime Error in forward]")
-        print("+ {:12} {}/{}".format("forward", 0, 10))
-        return 0
-    
-    try:
-        
-        numpy_model.backward(numpy_loss.backward())
-        
-        check_flag_2 = [
-            check_result(numpy_model.log_grad, torch_model.log_input.grad),
-            check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad),
-            check_result(numpy_model.W3_grad, torch_model.W3.grad),
-            check_result(numpy_model.W2_grad, torch_model.W2.grad),
-            check_result(numpy_model.W1_grad, torch_model.W1.grad)
-        ]
-        check_flag_2 = sum(check_flag_2) >= 4
-        print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20))
-    except:
-        print("[Runtime Error in backward]")
-        print("+ {:12} {}/{}".format("backward", 0, 20))
-        check_flag_2 = False
-    
-    return 10 * check_flag_1 + 20 * check_flag_2
-
-
-if __name__ == "__main__":
-    testcases = [
-        ["matmul", case_1, 5],
-        ["relu", case_2, 5],
-        ["log", case_3, 5],
-        ["softmax_1", case_4, 5],
-        ["softmax_2", case_5, 10],
-    ]
-    score = 0
-    for case in testcases:
-        try:
-            res = case[2] if case[1]() else 0
-        except:
-            print("[Runtime Error in {}]".format(case[0]))
-            res = 0
-        score += res
-        print("+ {:12} {}/{}".format(case[0], res, case[2]))
-    score += test_model()
-    print("{:14} {}/60".format("FINAL SCORE", score))
diff --git a/assignment-2/submission/17307130331/torch_mnist.py b/assignment-2/submission/17307130331/torch_mnist.py
deleted file mode 100644
index 6d3e214c7606e3d43dac4b94554f942508afffb3..0000000000000000000000000000000000000000
--- a/assignment-2/submission/17307130331/torch_mnist.py
+++ /dev/null
@@ -1,73 +0,0 @@
-import torch
-from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve
-
-
-class TorchModel:
-    
-    def __init__(self):
-        self.W1 = torch.randn((28 * 28, 256), requires_grad=True)
-        self.W2 = torch.randn((256, 64), requires_grad=True)
-        self.W3 = torch.randn((64, 10), requires_grad=True)
-        self.softmax_input = None
-        self.log_input = None
-    
-    def forward(self, x):
-        x = x.reshape(-1, 28 * 28)
-        x = torch.relu(torch.matmul(x, self.W1))
-        x = torch.relu(torch.matmul(x, self.W2))
-        x = torch.matmul(x, self.W3)
-        
-        self.softmax_input = x
-        self.softmax_input.retain_grad()
-        
-        x = torch.softmax(x, 1)
-        
-        self.log_input = x
-        self.log_input.retain_grad()
-        
-        x = torch.log(x)
-        
-        return x
-    
-    def optimize(self, learning_rate):
-        with torch.no_grad():
-            self.W1 -= learning_rate * self.W1.grad
-            self.W2 -= learning_rate * self.W2.grad
-            self.W3 -= learning_rate * self.W3.grad
-            
-            self.W1.grad = None
-            self.W2.grad = None
-            self.W3.grad = None
-
-
-def torch_run():
-    train_dataset, test_dataset = download_mnist()
-    
-    model = TorchModel()
-    model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False)
-    
-    train_loss = []
-    
-    epoch_number = 3
-    learning_rate = 0.1
-    
-    for epoch in range(epoch_number):
-        for x, y in mini_batch(train_dataset, numpy=False):
-            y = one_hot(y, numpy=False)
-            
-            y_pred = model.forward(x)
-            loss = (-y_pred * y).sum(dim=1).mean()
-            loss.backward()
-            model.optimize(learning_rate)
-            
-            train_loss.append(loss.item())
-        
-        x, y = batch(test_dataset, numpy=False)[0]
-        accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item()
-        print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
-    
-    plot_curve(train_loss)
-
-
-if __name__ == "__main__":
-    torch_run()
diff --git a/assignment-2/submission/17307130331/utils.py b/assignment-2/submission/17307130331/utils.py
deleted file mode 100644
index 709220cfa7a924d914ec1c098c505f864bcd4cfc..0000000000000000000000000000000000000000
--- a/assignment-2/submission/17307130331/utils.py
+++ /dev/null
@@ -1,71 +0,0 @@
-import torch
-import numpy as np
-from matplotlib import pyplot as plt
-
-
-def plot_curve(data):
-    plt.plot(range(len(data)), data, color='blue')
-    plt.legend(['loss_value'], loc='upper right')
-    plt.xlabel('step')
-    plt.ylabel('value')
-    plt.show()
-
-
-def download_mnist():
-    from torchvision import datasets, transforms
-    
-    transform = transforms.Compose([
-        transforms.ToTensor(),
-        transforms.Normalize(mean=(0.1307,), std=(0.3081,))
-    ])
-    
-    train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True)
-    test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True)
-    
-    return train_dataset, test_dataset
-
-
-def one_hot(y, numpy=True):
-    if numpy:
-        y_ = np.zeros((y.shape[0], 10))
-        y_[np.arange(y.shape[0], dtype=np.int32), y] = 1
-        return y_
-    else:
-        y_ = torch.zeros((y.shape[0], 10))
-        y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1
-    return y_
-
-
-def batch(dataset, numpy=True):
-    data = []
-    label = []
-    for each in dataset:
-        data.append(each[0])
-        label.append(each[1])
-    data = torch.stack(data)
-    label = torch.LongTensor(label)
-    if numpy:
-        return [(data.numpy(), label.numpy())]
-    else:
-        return [(data, label)]
-
-
-def mini_batch(dataset, batch_size=128, numpy=False):
-    return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
-
-
-def get_torch_initialization(numpy=True):
-    fc1 = torch.nn.Linear(28 * 28, 256)
-    fc2 = torch.nn.Linear(256, 64)
-    fc3 = torch.nn.Linear(64, 10)
-    
-    if numpy:
-        W1 = fc1.weight.T.detach().clone().numpy()
-        W2 = fc2.weight.T.detach().clone().numpy()
-        W3 = fc3.weight.T.detach().clone().numpy()
-    else:
-        W1 = fc1.weight.T.detach().clone().data
-        W2 = fc2.weight.T.detach().clone().data
-        W3 = fc3.weight.T.detach().clone().data
-    
-    return W1, W2, W3