diff --git a/assignment-2/submission/18307130090/README.md b/assignment-2/submission/18307130090/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..647eb99d08956f5fea84c6aa563ab3e1576cfcc6
--- /dev/null
+++ b/assignment-2/submission/18307130090/README.md
@@ -0,0 +1,276 @@
+# PRML-2021 Assignment2
+
+姓名：夏海淞
+
+学号：18307130090
+
+## 简述
+
+在本次实验中，我通过`NumPy`实现了一个简单的前馈神经网络，其中包括`numpy_fnn.py`中算子的反向传播以及前馈神经网络模型的构建。为了验证模型效果，我在MNIST数据集上进行了训练和测试。此外，我还实现了`Momentum`和`Adam`优化算法，并比较了它们的性能。
+
+## 算子的反向传播
+
+### `Matmul`
+
+`Matmul`的计算公式为：
+$$
+Y=X\times W
+$$
+其中$Y,X,W$分别为$n\times d',n\times d,d\times d'$的矩阵。
+
+由[神经网络与深度学习-邱锡鹏](https://nndl.github.io/nndl-book.pdf)中公式(B.20)和(B.21)，有
+$$
+\frac{\partial Y}{\partial W}=\frac{\partial(X\times W)}{\partial W}=X^T\\\\
+\frac{\partial Y}{\partial X}=\frac{\partial(X\times W)}{\partial X}=W^T
+$$
+结合链式法则和矩阵运算法则，可得
+$$
+\nabla_X=\nabla_Y\times W^T\\\\
+\nabla_W=X^T\times \nabla_Y
+$$
+
+### `Relu`
+
+`Relu`的计算公式为：
+$$
+Y_{ij}=\begin{cases}
+X_{ij}&X_{ij}\ge0\\\\
+0&\text{otherwise}
+\end{cases}
+$$
+因此有
+$$
+\frac{\partial Y_{ij}}{\partial X_{ij}}=\begin{cases}
+1&X_{ij}>0\\\\
+0&\text{otherwise}
+\end{cases}
+$$
+结合链式法则，得到反向传播的计算公式：$\nabla_{Xij}=\nabla_{Yij}\cdot\frac{\partial Y_{ij}}{\partial X_{ij}}$
+
+### `Log`
+
+`Log`的计算公式为
+$$
+Y_{ij}=\ln(X_{ij}+\epsilon),\epsilon=10^{-12}
+$$
+因此有
+$$
+\frac{\partial Y_{ij}}{\partial X_{ij}}=\frac1{X_{ij}+\epsilon}
+$$
+结合链式法则，得到反向传播的计算公式：$\nabla_{Xij}=\nabla_{Yij}\cdot\frac{\partial Y_{ij}}{\partial {X_{ij}}}$
+
+### `Softmax`
+
+`Softmax`的计算公式为
+$$
+Y_{ij}=\frac{\exp\{X_{ij} \}}{\sum_{k=1}^c\exp\{X_{ik} \}}
+$$
+其中$Y,X$均为$N\times c$的矩阵。容易发现`Softmax`以$X$的每行作为单位进行运算。因此对于$X,Y$的行分量$X_k,Y_k$，有
+$$
+\frac{\partial Y_{ki}}{\partial X_{kj}}=\begin{cases}
+\frac{\exp\{X_{kj} \}(\sum_t\exp\{X_{kt}\})-\exp\{2X_{ki}\}}{(\sum_t\exp\{X_{kt}\})^2}=Y_{ki}(1-Y_{ki})&i=j\\\\
+-\frac{\exp\{X_{ki} \}\exp\{X_{kj} \}}{(\sum_t\exp\{X_{kt}\})^2}=-Y_{ki}Y_{kj}&i\not=j
+\end{cases}
+$$
+因此可计算得到$X_k,Y_k$的Jacob矩阵，满足$J_{ij}=\frac{\partial Y_{ki}}{\partial X_{kj}}$。结合链式法则，可得
+$$
+\nabla_X=\nabla_Y\times J
+$$
+将行分量组合起来，就得到了反向传播的最终结果。
+
+## 模型构建与训练
+
+### 模型构建
+
+#### `forward`
+
+参考`torch_mnist.py`中`TorchModel`方法的模型，使用如下代码构建：
+
+```python
+def forward(self, x):
+    x = x.reshape(-1, 28 * 28)
+
+    x = self.relu_1.forward(self.matmul_1.forward(x, self.W1))
+    x = self.relu_2.forward(self.matmul_2.forward(x, self.W2))
+    x = self.matmul_3.forward(x, self.W3)
+
+    x = self.log.forward(self.softmax.forward(x))
+    
+    return x
+```
+
+模型的计算图如下：
+
+![](./img/fnn_model.png)
+
+#### `backward`
+
+根据模型的计算图，按照反向的计算顺序依次调用对应算子的反向传播算法即可。
+
+```python
+def backward(self, y):
+    self.log_grad = self.log.backward(y)
+    self.softmax_grad = self.softmax.backward(self.log_grad)
+    self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+    self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+    self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+    self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+    self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+
+    return self.x1_grad
+```
+
+#### `mini_batch`
+
+`mini_batch`的作用是提高模型的训练速度，同时得到较好的优化效果。传统的批处理方法对整个数据集计算平均的损失函数值，随后计算相应梯度进行反向传播。当训练数据集容量较大时，对训练速度造成严重影响；而随机方法则对数据集的每个样本计算损失函数值，随后计算相应梯度进行反向传播。此时数据集容量不对训练速度产生影响，然而由于样本的随机性，可能导致参数无法收敛到最优值，在最优值附近震荡。因此一个折中的方法是将数据集划分为若干批次，在提高训练速度的同时保证了较好的收敛效果。
+
+在本次实验中，我参照`utils.py`中的`mini_batch`，在`numpy_mnist.py`中重新实现了`mini_batch`方法：
+
+```python
+def mini_batch(dataset, batch_size=128):
+    data = np.array([np.array(each[0]) for each in dataset])
+    label = np.array([each[1] for each in dataset])
+
+    size = data.shape[0]
+    index = np.arange(size)
+    np.random.shuffle(index)
+
+    return [(data[index[i:i + batch_size]], label[index[i:i + batch_size]]) for i in range(0, size, batch_size)]
+```
+
+### 模型训练
+
+设定`learning_rate=0.1`，`batch_size=128`，`epoch_number=10`。训练结果如下：
+
+```
+[0] Accuracy: 0.9486
+[1] Accuracy: 0.9643
+[2] Accuracy: 0.9724
+[3] Accuracy: 0.9738
+[4] Accuracy: 0.9781
+[5] Accuracy: 0.9768
+[6] Accuracy: 0.9796
+[7] Accuracy: 0.9802
+[8] Accuracy: 0.9800
+[9] Accuracy: 0.9796
+```
+
+<img src="./img/SGD_normal.png" style="zoom: 80%;" />
+
+尝试缩减`batch_size`的大小，设定`batch_size=64`。训练结果如下：
+
+```
+[0] Accuracy: 0.9597
+[1] Accuracy: 0.9715
+[2] Accuracy: 0.9739
+[3] Accuracy: 0.9771
+[4] Accuracy: 0.9775
+[5] Accuracy: 0.9803
+[6] Accuracy: 0.9808
+[7] Accuracy: 0.9805
+[8] Accuracy: 0.9805
+[9] Accuracy: 0.9716
+```
+
+<img src="./img/SGD_batch_size.png" style="zoom:80%;" />
+
+尝试降低`learning_rate`，设定`learning_rate=0.01`。训练结果如下：
+
+```
+[0] Accuracy: 0.8758
+[1] Accuracy: 0.9028
+[2] Accuracy: 0.9143
+[3] Accuracy: 0.9234
+[4] Accuracy: 0.9298
+[5] Accuracy: 0.9350
+[6] Accuracy: 0.9397
+[7] Accuracy: 0.9434
+[8] Accuracy: 0.9459
+[9] Accuracy: 0.9501
+```
+
+<img src="./img/SGD_learning_rate.png" style="zoom:80%;" />
+
+根据实验结果，可以得出以下结论：
+
+当学习率和批处理容量合适时，参数的收敛速度随着学习率的减小而减小，而参数的震荡幅度随着批处理容量的减小而增大。
+
+## 梯度下降算法的改进
+
+传统的梯度下降算法可以表述为：
+$$
+w_{t+1}=w_t-\eta\cdot\nabla f(w_t)
+$$
+尽管梯度下降作为优化算法被广泛使用，它依然存在一些缺点，主要表现为：
+
+- 参数修正方向完全由当前梯度决定，导致当学习率过高时参数可能在最优点附近震荡；
+- 学习率无法随着训练进度改变，导致训练前期收敛速度较慢，后期可能无法收敛。
+
+针对上述缺陷，产生了许多梯度下降算法的改进算法。其中较为典型的是`Momentum`算法和`Adam`算法。
+
+### `Momentum`
+
+针对“参数修正方向完全由当前梯度决定”的问题，`Momentum`引入了“动量”的概念。
+
+类比现实世界，当小球从高处向低处滚动时，其运动方向不仅与当前位置的“陡峭程度”相关，也和当前的速度，即先前位置的“陡峭程度”相关。因此在`Momentum`算法中，参数的修正值不是取决于当前梯度，而是取决于梯度的各时刻的指数移动平均值：
+$$
+m_t=\beta\cdot m_{t-1}+(1-\beta)\cdot\nabla f(w_t)\\\\
+w_{t+1}=w_t-\eta\cdot m_t
+$$
+指数移动平均值反映了参数调整时的“惯性”。当参数调整方向正确时，`Momentum`有助于加快训练速度，减少震荡的幅度；然而当参数调整方向错误时，`Momentum`会因为无法及时调整方向造成性能上的部分损失。
+
+使用`Momentum`算法的训练结果如下：
+
+```
+[0] Accuracy: 0.9444
+[1] Accuracy: 0.9627
+[2] Accuracy: 0.9681
+[3] Accuracy: 0.9731
+[4] Accuracy: 0.9765
+[5] Accuracy: 0.9755
+[6] Accuracy: 0.9768
+[7] Accuracy: 0.9790
+[8] Accuracy: 0.9794
+[9] Accuracy: 0.9819
+```
+
+<img src="./img/SGDM.png" style="zoom:80%;" />
+
+可以看出相较传统的梯度下降算法并无明显优势。
+
+### `Adam`
+
+针对“学习率无法随着训练进度改变”的问题，`Adam`在`Momentum`的基础上引入了“二阶动量”的概念。
+
+`Adam`的改进思路为：由于神经网络中存在大量参数，不同参数的调整频率存在差别。对于频繁更新的参数，我们希望适当降低其学习率，提高收敛概率；而对于其他参数，我们希望适当增大其学习率，加快收敛速度。同时，参数的调整频率可能发生动态改变，我们也希望学习率能够随之动态调整。
+
+因为参数的调整值与当前梯度直接相关，因此取历史梯度的平方和作为衡量参数调整频率的标准。如果历史梯度平方和较大，表明参数被频繁更新，需要降低其学习率。因此梯度下降算法改写为：
+$$
+m_t=\beta\cdot m_{t-1}+(1-\beta)\cdot\nabla f(w_t)\\\\
+V_t=V_{t-1}+\nabla^2f(w_t)\\\\
+w_{t+1}=w_t-\frac\eta{\sqrt{V_t}}\cdot m_t
+$$
+然而，由于$V_t$关于$t$单调递增，可能导致训练后期学习率过低，参数无法收敛至最优。因此将$V_t$也改为指数移动平均值，避免了上述缺陷：
+$$
+m_t=\beta_1\cdot m_{t-1}+(1-\beta_1)\cdot\nabla f(w_t)\\\\
+V_t=\beta_2\cdot V_{t-1}+(1-\beta_2)\cdot\nabla^2f(w_t)\\\\
+w_{t+1}=w_t-\frac\eta{\sqrt{V_t}}\cdot m_t
+$$
+使用`Adam`算法的训练结果如下：
+
+```
+[0] Accuracy: 0.9657
+[1] Accuracy: 0.9724
+[2] Accuracy: 0.9759
+[3] Accuracy: 0.9769
+[4] Accuracy: 0.9788
+[5] Accuracy: 0.9778
+[6] Accuracy: 0.9775
+[7] Accuracy: 0.9759
+[8] Accuracy: 0.9786
+[9] Accuracy: 0.9779
+```
+
+<img src="./img/Adam.png" style="zoom:80%;" />
+
+可以看出相较传统的梯度下降算法，损失函数值的震荡幅度有所减小，而收敛速度与传统方法相当。
\ No newline at end of file
diff --git a/assignment-2/submission/18307130090/img/Adam.png b/assignment-2/submission/18307130090/img/Adam.png
new file mode 100644
index 0000000000000000000000000000000000000000..fe0326ebad52ad9356bdd7410834d9d61e9e5152
Binary files /dev/null and b/assignment-2/submission/18307130090/img/Adam.png differ
diff --git a/assignment-2/submission/18307130090/img/SGDM.png b/assignment-2/submission/18307130090/img/SGDM.png
new file mode 100644
index 0000000000000000000000000000000000000000..ba7ad91c5569f2605e7944afe3803863b8072b46
Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGDM.png differ
diff --git a/assignment-2/submission/18307130090/img/SGD_batch_size.png b/assignment-2/submission/18307130090/img/SGD_batch_size.png
new file mode 100644
index 0000000000000000000000000000000000000000..328c4cc7bf90ef75a09f8c97ee8e9134d44a33dd
Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_batch_size.png differ
diff --git a/assignment-2/submission/18307130090/img/SGD_learning_rate.png b/assignment-2/submission/18307130090/img/SGD_learning_rate.png
new file mode 100644
index 0000000000000000000000000000000000000000..7bca928d1aa569b08dad43d761da1b6e27e02942
Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_learning_rate.png differ
diff --git a/assignment-2/submission/18307130090/img/SGD_normal.png b/assignment-2/submission/18307130090/img/SGD_normal.png
new file mode 100644
index 0000000000000000000000000000000000000000..e6f3933e1bf979fa7b3b643d8f7fe823610109e9
Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_normal.png differ
diff --git a/assignment-2/submission/18307130090/img/fnn_model.png b/assignment-2/submission/18307130090/img/fnn_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..29ed50732a88ed1ca38a1cb3c6e82099a3d3e087
Binary files /dev/null and b/assignment-2/submission/18307130090/img/fnn_model.png differ
diff --git a/assignment-2/submission/18307130090/numpy_fnn.py b/assignment-2/submission/18307130090/numpy_fnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..7010cad4609f7ae31b8bdc0b19cedc005c5b950c
--- /dev/null
+++ b/assignment-2/submission/18307130090/numpy_fnn.py
@@ -0,0 +1,239 @@
+import numpy as np
+
+
+class NumpyOp:
+
+    def __init__(self):
+        self.memory = {}
+        self.epsilon = 1e-12
+
+
+class Matmul(NumpyOp):
+
+    def forward(self, x, W):
+        """
+        x: shape(N, d)
+        w: shape(d, d')
+        """
+        self.memory['x'] = x
+        self.memory['W'] = W
+        h = np.matmul(x, W)
+        return h
+
+    def backward(self, grad_y):
+        """
+        grad_y: shape(N, d')
+        """
+        x, W = self.memory['x'], self.memory['W']
+        grad_x = np.matmul(grad_y, W.T)
+        grad_W = np.matmul(x.T, grad_y)
+
+        return grad_x, grad_W
+
+
+class Relu(NumpyOp):
+
+    def forward(self, x):
+        self.memory['x'] = x
+        return np.where(x > 0, x, np.zeros_like(x))
+
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        x = self.memory['x']
+        grad_x = grad_y * np.where(x > 0, np.ones_like(x), np.zeros_like(x))
+
+        return grad_x
+
+
+class Log(NumpyOp):
+
+    def forward(self, x):
+        """
+        x: shape(N, c)
+        """
+
+        out = np.log(x + self.epsilon)
+        self.memory['x'] = x
+
+        return out
+
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        x = self.memory['x']
+        grad_x = grad_y * np.reciprocal(x + self.epsilon)
+
+        return grad_x
+
+
+class Softmax(NumpyOp):
+    """
+    softmax over last dimension
+    """
+
+    def forward(self, x):
+        """
+        x: shape(N, c)
+        """
+        exp_x = np.exp(x - x.max())
+        exp_sum = np.sum(exp_x, axis=1, keepdims=True)
+        out = exp_x / exp_sum
+        self.memory['x'] = x
+        self.memory['out'] = out
+
+        return out
+
+    def backward(self, grad_y):
+        """
+        grad_y: same shape as x
+        """
+        sm = self.memory['out']
+        Jacobs = np.array([np.diag(r) - np.outer(r, r) for r in sm])
+
+        grad_y = grad_y[:, np.newaxis, :]
+        grad_x = np.matmul(grad_y, Jacobs).squeeze(axis=1)
+
+        return grad_x
+
+
+class NumpyLoss:
+
+    def __init__(self):
+        self.target = None
+
+    def get_loss(self, pred, target):
+        self.target = target
+        return (-pred * target).sum(axis=1).mean()
+
+    def backward(self):
+        return -self.target / self.target.shape[0]
+
+
+class NumpyModel:
+    def __init__(self):
+        self.W1 = np.random.normal(size=(28 * 28, 256))
+        self.W2 = np.random.normal(size=(256, 64))
+        self.W3 = np.random.normal(size=(64, 10))
+
+        # 以下算子会在 forward 和 backward 中使用
+        self.matmul_1 = Matmul()
+        self.relu_1 = Relu()
+        self.matmul_2 = Matmul()
+        self.relu_2 = Relu()
+        self.matmul_3 = Matmul()
+        self.softmax = Softmax()
+        self.log = Log()
+
+        # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度（ loss 关于算子输入的偏导）
+        self.x1_grad, self.W1_grad = None, None
+        self.relu_1_grad = None
+        self.x2_grad, self.W2_grad = None, None
+        self.relu_2_grad = None
+        self.x3_grad, self.W3_grad = None, None
+        self.softmax_grad = None
+        self.log_grad = None
+
+        self.beta_1 = 0.9
+        self.beta_2 = 0.999
+        self.epsilon = 1e-8
+        self.is_first = True
+
+        self.W1_grad_mean = None
+        self.W2_grad_mean = None
+        self.W3_grad_mean = None
+
+        self.W1_grad_square_mean = None
+        self.W2_grad_square_mean = None
+        self.W3_grad_square_mean = None
+
+    def forward(self, x):
+        x = x.reshape(-1, 28 * 28)
+
+        x = self.relu_1.forward(self.matmul_1.forward(x, self.W1))
+        x = self.relu_2.forward(self.matmul_2.forward(x, self.W2))
+        x = self.matmul_3.forward(x, self.W3)
+
+        x = self.log.forward(self.softmax.forward(x))
+
+        return x
+
+    def backward(self, y):
+        self.log_grad = self.log.backward(y)
+        self.softmax_grad = self.softmax.backward(self.log_grad)
+        self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+        self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+        self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+        self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+        self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+
+        return self.x1_grad
+
+    def optimize(self, learning_rate):
+        def SGD():
+            self.W1 -= learning_rate * self.W1_grad
+            self.W2 -= learning_rate * self.W2_grad
+            self.W3 -= learning_rate * self.W3_grad
+
+        def SGDM():
+            if self.is_first:
+                self.is_first = False
+
+                self.W1_grad_mean = self.W1_grad
+                self.W2_grad_mean = self.W2_grad
+                self.W3_grad_mean = self.W3_grad
+            else:
+                self.W1_grad_mean = self.beta_1 * self.W1_grad_mean + (1 - self.beta_1) * self.W1_grad
+                self.W2_grad_mean = self.beta_1 * self.W2_grad_mean + (1 - self.beta_1) * self.W2_grad
+                self.W3_grad_mean = self.beta_1 * self.W3_grad_mean + (1 - self.beta_1) * self.W3_grad
+
+            delta_1 = learning_rate * self.W1_grad_mean
+            delta_2 = learning_rate * self.W2_grad_mean
+            delta_3 = learning_rate * self.W3_grad_mean
+
+            self.W1 -= delta_1
+            self.W2 -= delta_2
+            self.W3 -= delta_3
+
+        def Adam(learning_rate=0.001):
+            if self.is_first:
+                self.is_first = False
+                self.W1_grad_mean = self.W1_grad
+                self.W2_grad_mean = self.W2_grad
+                self.W3_grad_mean = self.W3_grad
+
+                self.W1_grad_square_mean = np.square(self.W1_grad)
+                self.W2_grad_square_mean = np.square(self.W2_grad)
+                self.W3_grad_square_mean = np.square(self.W3_grad)
+
+                self.W1 -= learning_rate * self.W1_grad_mean
+                self.W2 -= learning_rate * self.W2_grad_mean
+                self.W3 -= learning_rate * self.W3_grad_mean
+            else:
+                self.W1_grad_mean = self.beta_1 * self.W1_grad_mean + (1 - self.beta_1) * self.W1_grad
+                self.W2_grad_mean = self.beta_1 * self.W2_grad_mean + (1 - self.beta_1) * self.W2_grad
+                self.W3_grad_mean = self.beta_1 * self.W3_grad_mean + (1 - self.beta_1) * self.W3_grad
+
+                self.W1_grad_square_mean = self.beta_2 * self.W1_grad_square_mean + (1 - self.beta_2) * np.square(
+                    self.W1_grad)
+                self.W2_grad_square_mean = self.beta_2 * self.W2_grad_square_mean + (1 - self.beta_2) * np.square(
+                    self.W2_grad)
+                self.W3_grad_square_mean = self.beta_2 * self.W3_grad_square_mean + (1 - self.beta_2) * np.square(
+                    self.W3_grad)
+
+                delta_1 = learning_rate * self.W1_grad_mean * np.reciprocal(
+                    np.sqrt(self.W1_grad_square_mean) + np.full_like(self.W1_grad_square_mean, self.epsilon))
+                delta_2 = learning_rate * self.W2_grad_mean * np.reciprocal(
+                    np.sqrt(self.W2_grad_square_mean) + np.full_like(self.W2_grad_square_mean, self.epsilon))
+                delta_3 = learning_rate * self.W3_grad_mean * np.reciprocal(
+                    np.sqrt(self.W3_grad_square_mean) + np.full_like(self.W3_grad_square_mean, self.epsilon))
+
+                self.W1 -= delta_1
+                self.W2 -= delta_2
+                self.W3 -= delta_3
+
+        # SGD()
+        # SGDM()
+        Adam()
diff --git a/assignment-2/submission/18307130090/numpy_mnist.py b/assignment-2/submission/18307130090/numpy_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d67f25824dabdc5791ae5cc96655affe8315e72
--- /dev/null
+++ b/assignment-2/submission/18307130090/numpy_mnist.py
@@ -0,0 +1,50 @@
+import numpy as np
+
+from numpy_fnn import NumpyModel, NumpyLoss
+from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot
+
+
+def mini_batch(dataset, batch_size=128):
+    data = np.array([np.array(each[0]) for each in dataset])
+    label = np.array([each[1] for each in dataset])
+
+    size = data.shape[0]
+    index = np.arange(size)
+    np.random.shuffle(index)
+
+    return [(data[index[i:i + batch_size]], label[index[i:i + batch_size]]) for i in range(0, size, batch_size)]
+
+
+def numpy_run():
+    train_dataset, test_dataset = download_mnist()
+
+    model = NumpyModel()
+    numpy_loss = NumpyLoss()
+    model.W1, model.W2, model.W3 = get_torch_initialization()
+
+    train_loss = []
+
+    epoch_number = 10
+    learning_rate = 0.1
+
+    for epoch in range(epoch_number):
+        for x, y in mini_batch(train_dataset):
+            y = one_hot(y)
+
+            y_pred = model.forward(x)
+            loss = numpy_loss.get_loss(y_pred, y)
+
+            model.backward(numpy_loss.backward())
+            model.optimize(learning_rate)
+
+            train_loss.append(loss.item())
+
+        x, y = batch(test_dataset)[0]
+        accuracy = np.mean((model.forward(x).argmax(axis=1) == y))
+        print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+
+    plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+    numpy_run()
diff --git a/assignment-2/submission/18307130090/tester_demo.py b/assignment-2/submission/18307130090/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..504b3eef50a6df4d0aa433113136add50835e420
--- /dev/null
+++ b/assignment-2/submission/18307130090/tester_demo.py
@@ -0,0 +1,182 @@
+import numpy as np
+import torch
+from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log
+
+from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss
+from torch_mnist import TorchModel
+from utils import get_torch_initialization, one_hot
+
+err_epsilon = 1e-6
+err_p = 0.4
+
+
+def check_result(numpy_result, torch_result=None):
+    if isinstance(numpy_result, list) and torch_result is None:
+        flag = True
+        for (n, t) in numpy_result:
+            flag = flag and check_result(n, t)
+        return flag
+    # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item())
+    T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item()
+    direction = T / torch_result.numel() < err_p
+    return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item()
+
+
+def case_1():
+    x = np.random.normal(size=[5, 6])
+    W = np.random.normal(size=[6, 4])
+    
+    numpy_matmul = Matmul()
+    numpy_out = numpy_matmul.forward(x, W)
+    numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    torch_W = torch.from_numpy(W).clone().requires_grad_()
+    
+    torch_out = torch_matmul(torch_x, torch_W)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+        (numpy_W_grad, torch_W.grad)
+    ])
+
+
+def case_2():
+    x = np.random.normal(size=[5, 6])
+    
+    numpy_relu = Relu()
+    numpy_out = numpy_relu.forward(x)
+    numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_relu(torch_x)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def case_3():
+    x = np.random.uniform(low=0.0, high=1.0, size=[3, 4])
+    
+    numpy_log = Log()
+    numpy_out = numpy_log.forward(x)
+    numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_log(torch_x)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def case_4():
+    x = np.random.normal(size=[4, 5])
+    
+    numpy_softmax = Softmax()
+    numpy_out = numpy_softmax.forward(x)
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_softmax(torch_x, 1)
+    
+    return check_result(numpy_out, torch_out)
+
+
+def case_5():
+    x = np.random.normal(size=[20, 25])
+    
+    numpy_softmax = Softmax()
+    numpy_out = numpy_softmax.forward(x)
+    numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out))
+    
+    torch_x = torch.from_numpy(x).clone().requires_grad_()
+    
+    torch_out = torch_softmax(torch_x, 1)
+    torch_out.sum().backward()
+    
+    return check_result([
+        (numpy_out, torch_out),
+        (numpy_x_grad, torch_x.grad),
+    ])
+
+
+def test_model():
+    try:
+        numpy_loss = NumpyLoss()
+        numpy_model = NumpyModel()
+        torch_model = TorchModel()
+        torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False)
+        numpy_model.W1 = torch_model.W1.detach().clone().numpy()
+        numpy_model.W2 = torch_model.W2.detach().clone().numpy()
+        numpy_model.W3 = torch_model.W3.detach().clone().numpy()
+        
+        x = torch.randn((10000, 28, 28))
+        y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000)
+        
+        y = one_hot(y, numpy=False)
+        x2 = x.numpy()
+        y_pred = torch_model.forward(x)
+        loss = (-y_pred * y).sum(dim=1).mean()
+        loss.backward()
+        
+        y_pred_numpy = numpy_model.forward(x2)
+        numpy_loss.get_loss(y_pred_numpy, y.numpy())
+        
+        check_flag_1 = check_result(y_pred_numpy, y_pred)
+        print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10))
+    except:
+        print("[Runtime Error in forward]")
+        print("+ {:12} {}/{}".format("forward", 0, 10))
+        return 0
+    
+    try:
+        
+        numpy_model.backward(numpy_loss.backward())
+        
+        check_flag_2 = [
+            check_result(numpy_model.log_grad, torch_model.log_input.grad),
+            check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad),
+            check_result(numpy_model.W3_grad, torch_model.W3.grad),
+            check_result(numpy_model.W2_grad, torch_model.W2.grad),
+            check_result(numpy_model.W1_grad, torch_model.W1.grad)
+        ]
+        check_flag_2 = sum(check_flag_2) >= 4
+        print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20))
+    except:
+        print("[Runtime Error in backward]")
+        print("+ {:12} {}/{}".format("backward", 0, 20))
+        check_flag_2 = False
+    
+    return 10 * check_flag_1 + 20 * check_flag_2
+
+
+if __name__ == "__main__":
+    testcases = [
+        ["matmul", case_1, 5],
+        ["relu", case_2, 5],
+        ["log", case_3, 5],
+        ["softmax_1", case_4, 5],
+        ["softmax_2", case_5, 10],
+    ]
+    score = 0
+    for case in testcases:
+        try:
+            res = case[2] if case[1]() else 0
+        except:
+            print("[Runtime Error in {}]".format(case[0]))
+            res = 0
+        score += res
+        print("+ {:12} {}/{}".format(case[0], res, case[2]))
+    score += test_model()
+    print("{:14} {}/60".format("FINAL SCORE", score))
diff --git a/assignment-2/submission/18307130090/torch_mnist.py b/assignment-2/submission/18307130090/torch_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d3e214c7606e3d43dac4b94554f942508afffb3
--- /dev/null
+++ b/assignment-2/submission/18307130090/torch_mnist.py
@@ -0,0 +1,73 @@
+import torch
+from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve
+
+
+class TorchModel:
+    
+    def __init__(self):
+        self.W1 = torch.randn((28 * 28, 256), requires_grad=True)
+        self.W2 = torch.randn((256, 64), requires_grad=True)
+        self.W3 = torch.randn((64, 10), requires_grad=True)
+        self.softmax_input = None
+        self.log_input = None
+    
+    def forward(self, x):
+        x = x.reshape(-1, 28 * 28)
+        x = torch.relu(torch.matmul(x, self.W1))
+        x = torch.relu(torch.matmul(x, self.W2))
+        x = torch.matmul(x, self.W3)
+        
+        self.softmax_input = x
+        self.softmax_input.retain_grad()
+        
+        x = torch.softmax(x, 1)
+        
+        self.log_input = x
+        self.log_input.retain_grad()
+        
+        x = torch.log(x)
+        
+        return x
+    
+    def optimize(self, learning_rate):
+        with torch.no_grad():
+            self.W1 -= learning_rate * self.W1.grad
+            self.W2 -= learning_rate * self.W2.grad
+            self.W3 -= learning_rate * self.W3.grad
+            
+            self.W1.grad = None
+            self.W2.grad = None
+            self.W3.grad = None
+
+
+def torch_run():
+    train_dataset, test_dataset = download_mnist()
+    
+    model = TorchModel()
+    model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False)
+    
+    train_loss = []
+    
+    epoch_number = 3
+    learning_rate = 0.1
+    
+    for epoch in range(epoch_number):
+        for x, y in mini_batch(train_dataset, numpy=False):
+            y = one_hot(y, numpy=False)
+            
+            y_pred = model.forward(x)
+            loss = (-y_pred * y).sum(dim=1).mean()
+            loss.backward()
+            model.optimize(learning_rate)
+            
+            train_loss.append(loss.item())
+        
+        x, y = batch(test_dataset, numpy=False)[0]
+        accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item()
+        print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+    
+    plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+    torch_run()
diff --git a/assignment-2/submission/18307130090/utils.py b/assignment-2/submission/18307130090/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..709220cfa7a924d914ec1c098c505f864bcd4cfc
--- /dev/null
+++ b/assignment-2/submission/18307130090/utils.py
@@ -0,0 +1,71 @@
+import torch
+import numpy as np
+from matplotlib import pyplot as plt
+
+
+def plot_curve(data):
+    plt.plot(range(len(data)), data, color='blue')
+    plt.legend(['loss_value'], loc='upper right')
+    plt.xlabel('step')
+    plt.ylabel('value')
+    plt.show()
+
+
+def download_mnist():
+    from torchvision import datasets, transforms
+    
+    transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize(mean=(0.1307,), std=(0.3081,))
+    ])
+    
+    train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True)
+    test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True)
+    
+    return train_dataset, test_dataset
+
+
+def one_hot(y, numpy=True):
+    if numpy:
+        y_ = np.zeros((y.shape[0], 10))
+        y_[np.arange(y.shape[0], dtype=np.int32), y] = 1
+        return y_
+    else:
+        y_ = torch.zeros((y.shape[0], 10))
+        y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1
+    return y_
+
+
+def batch(dataset, numpy=True):
+    data = []
+    label = []
+    for each in dataset:
+        data.append(each[0])
+        label.append(each[1])
+    data = torch.stack(data)
+    label = torch.LongTensor(label)
+    if numpy:
+        return [(data.numpy(), label.numpy())]
+    else:
+        return [(data, label)]
+
+
+def mini_batch(dataset, batch_size=128, numpy=False):
+    return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
+
+
+def get_torch_initialization(numpy=True):
+    fc1 = torch.nn.Linear(28 * 28, 256)
+    fc2 = torch.nn.Linear(256, 64)
+    fc3 = torch.nn.Linear(64, 10)
+    
+    if numpy:
+        W1 = fc1.weight.T.detach().clone().numpy()
+        W2 = fc2.weight.T.detach().clone().numpy()
+        W3 = fc3.weight.T.detach().clone().numpy()
+    else:
+        W1 = fc1.weight.T.detach().clone().data
+        W2 = fc2.weight.T.detach().clone().data
+        W3 = fc3.weight.T.detach().clone().data
+    
+    return W1, W2, W3