diff --git a/assignment-2/submission/18307130090/README.md b/assignment-2/submission/18307130090/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..647eb99d08956f5fea84c6aa563ab3e1576cfcc6
--- /dev/null
+++ b/assignment-2/submission/18307130090/README.md
@@ -0,0 +1,276 @@
+# PRML-2021 Assignment2
+
+姓名:夏海淞
+
+学号:18307130090
+
+## 简述
+
+在本次实验中,我通过`NumPy`实现了一个简单的前馈神经网络,其中包括`numpy_fnn.py`中算子的反向传播以及前馈神经网络模型的构建。为了验证模型效果,我在MNIST数据集上进行了训练和测试。此外,我还实现了`Momentum`和`Adam`优化算法,并比较了它们的性能。
+
+## 算子的反向传播
+
+### `Matmul`
+
+`Matmul`的计算公式为:
+$$
+Y=X\times W
+$$
+其中$Y,X,W$分别为$n\times d',n\times d,d\times d'$的矩阵。
+
+由[神经网络与深度学习-邱锡鹏](https://nndl.github.io/nndl-book.pdf)中公式(B.20)和(B.21),有
+$$
+\frac{\partial Y}{\partial W}=\frac{\partial(X\times W)}{\partial W}=X^T\\\\
+\frac{\partial Y}{\partial X}=\frac{\partial(X\times W)}{\partial X}=W^T
+$$
+结合链式法则和矩阵运算法则,可得
+$$
+\nabla_X=\nabla_Y\times W^T\\\\
+\nabla_W=X^T\times \nabla_Y
+$$
+
+### `Relu`
+
+`Relu`的计算公式为:
+$$
+Y_{ij}=\begin{cases}
+X_{ij}&X_{ij}\ge0\\\\
+0&\text{otherwise}
+\end{cases}
+$$
+因此有
+$$
+\frac{\partial Y_{ij}}{\partial X_{ij}}=\begin{cases}
+1&X_{ij}>0\\\\
+0&\text{otherwise}
+\end{cases}
+$$
+结合链式法则,得到反向传播的计算公式:$\nabla_{Xij}=\nabla_{Yij}\cdot\frac{\partial Y_{ij}}{\partial X_{ij}}$
+
+### `Log`
+
+`Log`的计算公式为
+$$
+Y_{ij}=\ln(X_{ij}+\epsilon),\epsilon=10^{-12}
+$$
+因此有
+$$
+\frac{\partial Y_{ij}}{\partial X_{ij}}=\frac1{X_{ij}+\epsilon}
+$$
+结合链式法则,得到反向传播的计算公式:$\nabla_{Xij}=\nabla_{Yij}\cdot\frac{\partial Y_{ij}}{\partial {X_{ij}}}$
+
+### `Softmax`
+
+`Softmax`的计算公式为
+$$
+Y_{ij}=\frac{\exp\{X_{ij} \}}{\sum_{k=1}^c\exp\{X_{ik} \}}
+$$
+其中$Y,X$均为$N\times c$的矩阵。容易发现`Softmax`以$X$的每行作为单位进行运算。因此对于$X,Y$的行分量$X_k,Y_k$,有
+$$
+\frac{\partial Y_{ki}}{\partial X_{kj}}=\begin{cases}
+\frac{\exp\{X_{kj} \}(\sum_t\exp\{X_{kt}\})-\exp\{2X_{ki}\}}{(\sum_t\exp\{X_{kt}\})^2}=Y_{ki}(1-Y_{ki})&i=j\\\\
+-\frac{\exp\{X_{ki} \}\exp\{X_{kj} \}}{(\sum_t\exp\{X_{kt}\})^2}=-Y_{ki}Y_{kj}&i\not=j
+\end{cases}
+$$
+因此可计算得到$X_k,Y_k$的Jacob矩阵,满足$J_{ij}=\frac{\partial Y_{ki}}{\partial X_{kj}}$。结合链式法则,可得
+$$
+\nabla_X=\nabla_Y\times J
+$$
+将行分量组合起来,就得到了反向传播的最终结果。
+
+## 模型构建与训练
+
+### 模型构建
+
+#### `forward`
+
+参考`torch_mnist.py`中`TorchModel`方法的模型,使用如下代码构建:
+
+```python
+def forward(self, x):
+ x = x.reshape(-1, 28 * 28)
+
+ x = self.relu_1.forward(self.matmul_1.forward(x, self.W1))
+ x = self.relu_2.forward(self.matmul_2.forward(x, self.W2))
+ x = self.matmul_3.forward(x, self.W3)
+
+ x = self.log.forward(self.softmax.forward(x))
+
+ return x
+```
+
+模型的计算图如下:
+
+
+
+#### `backward`
+
+根据模型的计算图,按照反向的计算顺序依次调用对应算子的反向传播算法即可。
+
+```python
+def backward(self, y):
+ self.log_grad = self.log.backward(y)
+ self.softmax_grad = self.softmax.backward(self.log_grad)
+ self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+ self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+ self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+ self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+ self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+
+ return self.x1_grad
+```
+
+#### `mini_batch`
+
+`mini_batch`的作用是提高模型的训练速度,同时得到较好的优化效果。传统的批处理方法对整个数据集计算平均的损失函数值,随后计算相应梯度进行反向传播。当训练数据集容量较大时,对训练速度造成严重影响;而随机方法则对数据集的每个样本计算损失函数值,随后计算相应梯度进行反向传播。此时数据集容量不对训练速度产生影响,然而由于样本的随机性,可能导致参数无法收敛到最优值,在最优值附近震荡。因此一个折中的方法是将数据集划分为若干批次,在提高训练速度的同时保证了较好的收敛效果。
+
+在本次实验中,我参照`utils.py`中的`mini_batch`,在`numpy_mnist.py`中重新实现了`mini_batch`方法:
+
+```python
+def mini_batch(dataset, batch_size=128):
+ data = np.array([np.array(each[0]) for each in dataset])
+ label = np.array([each[1] for each in dataset])
+
+ size = data.shape[0]
+ index = np.arange(size)
+ np.random.shuffle(index)
+
+ return [(data[index[i:i + batch_size]], label[index[i:i + batch_size]]) for i in range(0, size, batch_size)]
+```
+
+### 模型训练
+
+设定`learning_rate=0.1`,`batch_size=128`,`epoch_number=10`。训练结果如下:
+
+```
+[0] Accuracy: 0.9486
+[1] Accuracy: 0.9643
+[2] Accuracy: 0.9724
+[3] Accuracy: 0.9738
+[4] Accuracy: 0.9781
+[5] Accuracy: 0.9768
+[6] Accuracy: 0.9796
+[7] Accuracy: 0.9802
+[8] Accuracy: 0.9800
+[9] Accuracy: 0.9796
+```
+
+
+
+尝试缩减`batch_size`的大小,设定`batch_size=64`。训练结果如下:
+
+```
+[0] Accuracy: 0.9597
+[1] Accuracy: 0.9715
+[2] Accuracy: 0.9739
+[3] Accuracy: 0.9771
+[4] Accuracy: 0.9775
+[5] Accuracy: 0.9803
+[6] Accuracy: 0.9808
+[7] Accuracy: 0.9805
+[8] Accuracy: 0.9805
+[9] Accuracy: 0.9716
+```
+
+
+
+尝试降低`learning_rate`,设定`learning_rate=0.01`。训练结果如下:
+
+```
+[0] Accuracy: 0.8758
+[1] Accuracy: 0.9028
+[2] Accuracy: 0.9143
+[3] Accuracy: 0.9234
+[4] Accuracy: 0.9298
+[5] Accuracy: 0.9350
+[6] Accuracy: 0.9397
+[7] Accuracy: 0.9434
+[8] Accuracy: 0.9459
+[9] Accuracy: 0.9501
+```
+
+
+
+根据实验结果,可以得出以下结论:
+
+当学习率和批处理容量合适时,参数的收敛速度随着学习率的减小而减小,而参数的震荡幅度随着批处理容量的减小而增大。
+
+## 梯度下降算法的改进
+
+传统的梯度下降算法可以表述为:
+$$
+w_{t+1}=w_t-\eta\cdot\nabla f(w_t)
+$$
+尽管梯度下降作为优化算法被广泛使用,它依然存在一些缺点,主要表现为:
+
+- 参数修正方向完全由当前梯度决定,导致当学习率过高时参数可能在最优点附近震荡;
+- 学习率无法随着训练进度改变,导致训练前期收敛速度较慢,后期可能无法收敛。
+
+针对上述缺陷,产生了许多梯度下降算法的改进算法。其中较为典型的是`Momentum`算法和`Adam`算法。
+
+### `Momentum`
+
+针对“参数修正方向完全由当前梯度决定”的问题,`Momentum`引入了“动量”的概念。
+
+类比现实世界,当小球从高处向低处滚动时,其运动方向不仅与当前位置的“陡峭程度”相关,也和当前的速度,即先前位置的“陡峭程度”相关。因此在`Momentum`算法中,参数的修正值不是取决于当前梯度,而是取决于梯度的各时刻的指数移动平均值:
+$$
+m_t=\beta\cdot m_{t-1}+(1-\beta)\cdot\nabla f(w_t)\\\\
+w_{t+1}=w_t-\eta\cdot m_t
+$$
+指数移动平均值反映了参数调整时的“惯性”。当参数调整方向正确时,`Momentum`有助于加快训练速度,减少震荡的幅度;然而当参数调整方向错误时,`Momentum`会因为无法及时调整方向造成性能上的部分损失。
+
+使用`Momentum`算法的训练结果如下:
+
+```
+[0] Accuracy: 0.9444
+[1] Accuracy: 0.9627
+[2] Accuracy: 0.9681
+[3] Accuracy: 0.9731
+[4] Accuracy: 0.9765
+[5] Accuracy: 0.9755
+[6] Accuracy: 0.9768
+[7] Accuracy: 0.9790
+[8] Accuracy: 0.9794
+[9] Accuracy: 0.9819
+```
+
+
+
+可以看出相较传统的梯度下降算法并无明显优势。
+
+### `Adam`
+
+针对“学习率无法随着训练进度改变”的问题,`Adam`在`Momentum`的基础上引入了“二阶动量”的概念。
+
+`Adam`的改进思路为:由于神经网络中存在大量参数,不同参数的调整频率存在差别。对于频繁更新的参数,我们希望适当降低其学习率,提高收敛概率;而对于其他参数,我们希望适当增大其学习率,加快收敛速度。同时,参数的调整频率可能发生动态改变,我们也希望学习率能够随之动态调整。
+
+因为参数的调整值与当前梯度直接相关,因此取历史梯度的平方和作为衡量参数调整频率的标准。如果历史梯度平方和较大,表明参数被频繁更新,需要降低其学习率。因此梯度下降算法改写为:
+$$
+m_t=\beta\cdot m_{t-1}+(1-\beta)\cdot\nabla f(w_t)\\\\
+V_t=V_{t-1}+\nabla^2f(w_t)\\\\
+w_{t+1}=w_t-\frac\eta{\sqrt{V_t}}\cdot m_t
+$$
+然而,由于$V_t$关于$t$单调递增,可能导致训练后期学习率过低,参数无法收敛至最优。因此将$V_t$也改为指数移动平均值,避免了上述缺陷:
+$$
+m_t=\beta_1\cdot m_{t-1}+(1-\beta_1)\cdot\nabla f(w_t)\\\\
+V_t=\beta_2\cdot V_{t-1}+(1-\beta_2)\cdot\nabla^2f(w_t)\\\\
+w_{t+1}=w_t-\frac\eta{\sqrt{V_t}}\cdot m_t
+$$
+使用`Adam`算法的训练结果如下:
+
+```
+[0] Accuracy: 0.9657
+[1] Accuracy: 0.9724
+[2] Accuracy: 0.9759
+[3] Accuracy: 0.9769
+[4] Accuracy: 0.9788
+[5] Accuracy: 0.9778
+[6] Accuracy: 0.9775
+[7] Accuracy: 0.9759
+[8] Accuracy: 0.9786
+[9] Accuracy: 0.9779
+```
+
+
+
+可以看出相较传统的梯度下降算法,损失函数值的震荡幅度有所减小,而收敛速度与传统方法相当。
\ No newline at end of file
diff --git a/assignment-2/submission/18307130090/img/Adam.png b/assignment-2/submission/18307130090/img/Adam.png
new file mode 100644
index 0000000000000000000000000000000000000000..fe0326ebad52ad9356bdd7410834d9d61e9e5152
Binary files /dev/null and b/assignment-2/submission/18307130090/img/Adam.png differ
diff --git a/assignment-2/submission/18307130090/img/SGDM.png b/assignment-2/submission/18307130090/img/SGDM.png
new file mode 100644
index 0000000000000000000000000000000000000000..ba7ad91c5569f2605e7944afe3803863b8072b46
Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGDM.png differ
diff --git a/assignment-2/submission/18307130090/img/SGD_batch_size.png b/assignment-2/submission/18307130090/img/SGD_batch_size.png
new file mode 100644
index 0000000000000000000000000000000000000000..328c4cc7bf90ef75a09f8c97ee8e9134d44a33dd
Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_batch_size.png differ
diff --git a/assignment-2/submission/18307130090/img/SGD_learning_rate.png b/assignment-2/submission/18307130090/img/SGD_learning_rate.png
new file mode 100644
index 0000000000000000000000000000000000000000..7bca928d1aa569b08dad43d761da1b6e27e02942
Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_learning_rate.png differ
diff --git a/assignment-2/submission/18307130090/img/SGD_normal.png b/assignment-2/submission/18307130090/img/SGD_normal.png
new file mode 100644
index 0000000000000000000000000000000000000000..e6f3933e1bf979fa7b3b643d8f7fe823610109e9
Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_normal.png differ
diff --git a/assignment-2/submission/18307130090/img/fnn_model.png b/assignment-2/submission/18307130090/img/fnn_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..29ed50732a88ed1ca38a1cb3c6e82099a3d3e087
Binary files /dev/null and b/assignment-2/submission/18307130090/img/fnn_model.png differ
diff --git a/assignment-2/submission/18307130090/numpy_fnn.py b/assignment-2/submission/18307130090/numpy_fnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..7010cad4609f7ae31b8bdc0b19cedc005c5b950c
--- /dev/null
+++ b/assignment-2/submission/18307130090/numpy_fnn.py
@@ -0,0 +1,239 @@
+import numpy as np
+
+
+class NumpyOp:
+
+ def __init__(self):
+ self.memory = {}
+ self.epsilon = 1e-12
+
+
+class Matmul(NumpyOp):
+
+ def forward(self, x, W):
+ """
+ x: shape(N, d)
+ w: shape(d, d')
+ """
+ self.memory['x'] = x
+ self.memory['W'] = W
+ h = np.matmul(x, W)
+ return h
+
+ def backward(self, grad_y):
+ """
+ grad_y: shape(N, d')
+ """
+ x, W = self.memory['x'], self.memory['W']
+ grad_x = np.matmul(grad_y, W.T)
+ grad_W = np.matmul(x.T, grad_y)
+
+ return grad_x, grad_W
+
+
+class Relu(NumpyOp):
+
+ def forward(self, x):
+ self.memory['x'] = x
+ return np.where(x > 0, x, np.zeros_like(x))
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+ x = self.memory['x']
+ grad_x = grad_y * np.where(x > 0, np.ones_like(x), np.zeros_like(x))
+
+ return grad_x
+
+
+class Log(NumpyOp):
+
+ def forward(self, x):
+ """
+ x: shape(N, c)
+ """
+
+ out = np.log(x + self.epsilon)
+ self.memory['x'] = x
+
+ return out
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+ x = self.memory['x']
+ grad_x = grad_y * np.reciprocal(x + self.epsilon)
+
+ return grad_x
+
+
+class Softmax(NumpyOp):
+ """
+ softmax over last dimension
+ """
+
+ def forward(self, x):
+ """
+ x: shape(N, c)
+ """
+ exp_x = np.exp(x - x.max())
+ exp_sum = np.sum(exp_x, axis=1, keepdims=True)
+ out = exp_x / exp_sum
+ self.memory['x'] = x
+ self.memory['out'] = out
+
+ return out
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+ sm = self.memory['out']
+ Jacobs = np.array([np.diag(r) - np.outer(r, r) for r in sm])
+
+ grad_y = grad_y[:, np.newaxis, :]
+ grad_x = np.matmul(grad_y, Jacobs).squeeze(axis=1)
+
+ return grad_x
+
+
+class NumpyLoss:
+
+ def __init__(self):
+ self.target = None
+
+ def get_loss(self, pred, target):
+ self.target = target
+ return (-pred * target).sum(axis=1).mean()
+
+ def backward(self):
+ return -self.target / self.target.shape[0]
+
+
+class NumpyModel:
+ def __init__(self):
+ self.W1 = np.random.normal(size=(28 * 28, 256))
+ self.W2 = np.random.normal(size=(256, 64))
+ self.W3 = np.random.normal(size=(64, 10))
+
+ # 以下算子会在 forward 和 backward 中使用
+ self.matmul_1 = Matmul()
+ self.relu_1 = Relu()
+ self.matmul_2 = Matmul()
+ self.relu_2 = Relu()
+ self.matmul_3 = Matmul()
+ self.softmax = Softmax()
+ self.log = Log()
+
+ # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度( loss 关于算子输入的偏导)
+ self.x1_grad, self.W1_grad = None, None
+ self.relu_1_grad = None
+ self.x2_grad, self.W2_grad = None, None
+ self.relu_2_grad = None
+ self.x3_grad, self.W3_grad = None, None
+ self.softmax_grad = None
+ self.log_grad = None
+
+ self.beta_1 = 0.9
+ self.beta_2 = 0.999
+ self.epsilon = 1e-8
+ self.is_first = True
+
+ self.W1_grad_mean = None
+ self.W2_grad_mean = None
+ self.W3_grad_mean = None
+
+ self.W1_grad_square_mean = None
+ self.W2_grad_square_mean = None
+ self.W3_grad_square_mean = None
+
+ def forward(self, x):
+ x = x.reshape(-1, 28 * 28)
+
+ x = self.relu_1.forward(self.matmul_1.forward(x, self.W1))
+ x = self.relu_2.forward(self.matmul_2.forward(x, self.W2))
+ x = self.matmul_3.forward(x, self.W3)
+
+ x = self.log.forward(self.softmax.forward(x))
+
+ return x
+
+ def backward(self, y):
+ self.log_grad = self.log.backward(y)
+ self.softmax_grad = self.softmax.backward(self.log_grad)
+ self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+ self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+ self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+ self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+ self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+
+ return self.x1_grad
+
+ def optimize(self, learning_rate):
+ def SGD():
+ self.W1 -= learning_rate * self.W1_grad
+ self.W2 -= learning_rate * self.W2_grad
+ self.W3 -= learning_rate * self.W3_grad
+
+ def SGDM():
+ if self.is_first:
+ self.is_first = False
+
+ self.W1_grad_mean = self.W1_grad
+ self.W2_grad_mean = self.W2_grad
+ self.W3_grad_mean = self.W3_grad
+ else:
+ self.W1_grad_mean = self.beta_1 * self.W1_grad_mean + (1 - self.beta_1) * self.W1_grad
+ self.W2_grad_mean = self.beta_1 * self.W2_grad_mean + (1 - self.beta_1) * self.W2_grad
+ self.W3_grad_mean = self.beta_1 * self.W3_grad_mean + (1 - self.beta_1) * self.W3_grad
+
+ delta_1 = learning_rate * self.W1_grad_mean
+ delta_2 = learning_rate * self.W2_grad_mean
+ delta_3 = learning_rate * self.W3_grad_mean
+
+ self.W1 -= delta_1
+ self.W2 -= delta_2
+ self.W3 -= delta_3
+
+ def Adam(learning_rate=0.001):
+ if self.is_first:
+ self.is_first = False
+ self.W1_grad_mean = self.W1_grad
+ self.W2_grad_mean = self.W2_grad
+ self.W3_grad_mean = self.W3_grad
+
+ self.W1_grad_square_mean = np.square(self.W1_grad)
+ self.W2_grad_square_mean = np.square(self.W2_grad)
+ self.W3_grad_square_mean = np.square(self.W3_grad)
+
+ self.W1 -= learning_rate * self.W1_grad_mean
+ self.W2 -= learning_rate * self.W2_grad_mean
+ self.W3 -= learning_rate * self.W3_grad_mean
+ else:
+ self.W1_grad_mean = self.beta_1 * self.W1_grad_mean + (1 - self.beta_1) * self.W1_grad
+ self.W2_grad_mean = self.beta_1 * self.W2_grad_mean + (1 - self.beta_1) * self.W2_grad
+ self.W3_grad_mean = self.beta_1 * self.W3_grad_mean + (1 - self.beta_1) * self.W3_grad
+
+ self.W1_grad_square_mean = self.beta_2 * self.W1_grad_square_mean + (1 - self.beta_2) * np.square(
+ self.W1_grad)
+ self.W2_grad_square_mean = self.beta_2 * self.W2_grad_square_mean + (1 - self.beta_2) * np.square(
+ self.W2_grad)
+ self.W3_grad_square_mean = self.beta_2 * self.W3_grad_square_mean + (1 - self.beta_2) * np.square(
+ self.W3_grad)
+
+ delta_1 = learning_rate * self.W1_grad_mean * np.reciprocal(
+ np.sqrt(self.W1_grad_square_mean) + np.full_like(self.W1_grad_square_mean, self.epsilon))
+ delta_2 = learning_rate * self.W2_grad_mean * np.reciprocal(
+ np.sqrt(self.W2_grad_square_mean) + np.full_like(self.W2_grad_square_mean, self.epsilon))
+ delta_3 = learning_rate * self.W3_grad_mean * np.reciprocal(
+ np.sqrt(self.W3_grad_square_mean) + np.full_like(self.W3_grad_square_mean, self.epsilon))
+
+ self.W1 -= delta_1
+ self.W2 -= delta_2
+ self.W3 -= delta_3
+
+ # SGD()
+ # SGDM()
+ Adam()
diff --git a/assignment-2/submission/18307130090/numpy_mnist.py b/assignment-2/submission/18307130090/numpy_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d67f25824dabdc5791ae5cc96655affe8315e72
--- /dev/null
+++ b/assignment-2/submission/18307130090/numpy_mnist.py
@@ -0,0 +1,50 @@
+import numpy as np
+
+from numpy_fnn import NumpyModel, NumpyLoss
+from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot
+
+
+def mini_batch(dataset, batch_size=128):
+ data = np.array([np.array(each[0]) for each in dataset])
+ label = np.array([each[1] for each in dataset])
+
+ size = data.shape[0]
+ index = np.arange(size)
+ np.random.shuffle(index)
+
+ return [(data[index[i:i + batch_size]], label[index[i:i + batch_size]]) for i in range(0, size, batch_size)]
+
+
+def numpy_run():
+ train_dataset, test_dataset = download_mnist()
+
+ model = NumpyModel()
+ numpy_loss = NumpyLoss()
+ model.W1, model.W2, model.W3 = get_torch_initialization()
+
+ train_loss = []
+
+ epoch_number = 10
+ learning_rate = 0.1
+
+ for epoch in range(epoch_number):
+ for x, y in mini_batch(train_dataset):
+ y = one_hot(y)
+
+ y_pred = model.forward(x)
+ loss = numpy_loss.get_loss(y_pred, y)
+
+ model.backward(numpy_loss.backward())
+ model.optimize(learning_rate)
+
+ train_loss.append(loss.item())
+
+ x, y = batch(test_dataset)[0]
+ accuracy = np.mean((model.forward(x).argmax(axis=1) == y))
+ print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+
+ plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+ numpy_run()
diff --git a/assignment-2/submission/18307130090/tester_demo.py b/assignment-2/submission/18307130090/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..504b3eef50a6df4d0aa433113136add50835e420
--- /dev/null
+++ b/assignment-2/submission/18307130090/tester_demo.py
@@ -0,0 +1,182 @@
+import numpy as np
+import torch
+from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log
+
+from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss
+from torch_mnist import TorchModel
+from utils import get_torch_initialization, one_hot
+
+err_epsilon = 1e-6
+err_p = 0.4
+
+
+def check_result(numpy_result, torch_result=None):
+ if isinstance(numpy_result, list) and torch_result is None:
+ flag = True
+ for (n, t) in numpy_result:
+ flag = flag and check_result(n, t)
+ return flag
+ # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item())
+ T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item()
+ direction = T / torch_result.numel() < err_p
+ return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item()
+
+
+def case_1():
+ x = np.random.normal(size=[5, 6])
+ W = np.random.normal(size=[6, 4])
+
+ numpy_matmul = Matmul()
+ numpy_out = numpy_matmul.forward(x, W)
+ numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out))
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+ torch_W = torch.from_numpy(W).clone().requires_grad_()
+
+ torch_out = torch_matmul(torch_x, torch_W)
+ torch_out.sum().backward()
+
+ return check_result([
+ (numpy_out, torch_out),
+ (numpy_x_grad, torch_x.grad),
+ (numpy_W_grad, torch_W.grad)
+ ])
+
+
+def case_2():
+ x = np.random.normal(size=[5, 6])
+
+ numpy_relu = Relu()
+ numpy_out = numpy_relu.forward(x)
+ numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out))
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+ torch_out = torch_relu(torch_x)
+ torch_out.sum().backward()
+
+ return check_result([
+ (numpy_out, torch_out),
+ (numpy_x_grad, torch_x.grad),
+ ])
+
+
+def case_3():
+ x = np.random.uniform(low=0.0, high=1.0, size=[3, 4])
+
+ numpy_log = Log()
+ numpy_out = numpy_log.forward(x)
+ numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out))
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+ torch_out = torch_log(torch_x)
+ torch_out.sum().backward()
+
+ return check_result([
+ (numpy_out, torch_out),
+
+ (numpy_x_grad, torch_x.grad),
+ ])
+
+
+def case_4():
+ x = np.random.normal(size=[4, 5])
+
+ numpy_softmax = Softmax()
+ numpy_out = numpy_softmax.forward(x)
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+ torch_out = torch_softmax(torch_x, 1)
+
+ return check_result(numpy_out, torch_out)
+
+
+def case_5():
+ x = np.random.normal(size=[20, 25])
+
+ numpy_softmax = Softmax()
+ numpy_out = numpy_softmax.forward(x)
+ numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out))
+
+ torch_x = torch.from_numpy(x).clone().requires_grad_()
+
+ torch_out = torch_softmax(torch_x, 1)
+ torch_out.sum().backward()
+
+ return check_result([
+ (numpy_out, torch_out),
+ (numpy_x_grad, torch_x.grad),
+ ])
+
+
+def test_model():
+ try:
+ numpy_loss = NumpyLoss()
+ numpy_model = NumpyModel()
+ torch_model = TorchModel()
+ torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False)
+ numpy_model.W1 = torch_model.W1.detach().clone().numpy()
+ numpy_model.W2 = torch_model.W2.detach().clone().numpy()
+ numpy_model.W3 = torch_model.W3.detach().clone().numpy()
+
+ x = torch.randn((10000, 28, 28))
+ y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000)
+
+ y = one_hot(y, numpy=False)
+ x2 = x.numpy()
+ y_pred = torch_model.forward(x)
+ loss = (-y_pred * y).sum(dim=1).mean()
+ loss.backward()
+
+ y_pred_numpy = numpy_model.forward(x2)
+ numpy_loss.get_loss(y_pred_numpy, y.numpy())
+
+ check_flag_1 = check_result(y_pred_numpy, y_pred)
+ print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10))
+ except:
+ print("[Runtime Error in forward]")
+ print("+ {:12} {}/{}".format("forward", 0, 10))
+ return 0
+
+ try:
+
+ numpy_model.backward(numpy_loss.backward())
+
+ check_flag_2 = [
+ check_result(numpy_model.log_grad, torch_model.log_input.grad),
+ check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad),
+ check_result(numpy_model.W3_grad, torch_model.W3.grad),
+ check_result(numpy_model.W2_grad, torch_model.W2.grad),
+ check_result(numpy_model.W1_grad, torch_model.W1.grad)
+ ]
+ check_flag_2 = sum(check_flag_2) >= 4
+ print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20))
+ except:
+ print("[Runtime Error in backward]")
+ print("+ {:12} {}/{}".format("backward", 0, 20))
+ check_flag_2 = False
+
+ return 10 * check_flag_1 + 20 * check_flag_2
+
+
+if __name__ == "__main__":
+ testcases = [
+ ["matmul", case_1, 5],
+ ["relu", case_2, 5],
+ ["log", case_3, 5],
+ ["softmax_1", case_4, 5],
+ ["softmax_2", case_5, 10],
+ ]
+ score = 0
+ for case in testcases:
+ try:
+ res = case[2] if case[1]() else 0
+ except:
+ print("[Runtime Error in {}]".format(case[0]))
+ res = 0
+ score += res
+ print("+ {:12} {}/{}".format(case[0], res, case[2]))
+ score += test_model()
+ print("{:14} {}/60".format("FINAL SCORE", score))
diff --git a/assignment-2/submission/18307130090/torch_mnist.py b/assignment-2/submission/18307130090/torch_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d3e214c7606e3d43dac4b94554f942508afffb3
--- /dev/null
+++ b/assignment-2/submission/18307130090/torch_mnist.py
@@ -0,0 +1,73 @@
+import torch
+from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve
+
+
+class TorchModel:
+
+ def __init__(self):
+ self.W1 = torch.randn((28 * 28, 256), requires_grad=True)
+ self.W2 = torch.randn((256, 64), requires_grad=True)
+ self.W3 = torch.randn((64, 10), requires_grad=True)
+ self.softmax_input = None
+ self.log_input = None
+
+ def forward(self, x):
+ x = x.reshape(-1, 28 * 28)
+ x = torch.relu(torch.matmul(x, self.W1))
+ x = torch.relu(torch.matmul(x, self.W2))
+ x = torch.matmul(x, self.W3)
+
+ self.softmax_input = x
+ self.softmax_input.retain_grad()
+
+ x = torch.softmax(x, 1)
+
+ self.log_input = x
+ self.log_input.retain_grad()
+
+ x = torch.log(x)
+
+ return x
+
+ def optimize(self, learning_rate):
+ with torch.no_grad():
+ self.W1 -= learning_rate * self.W1.grad
+ self.W2 -= learning_rate * self.W2.grad
+ self.W3 -= learning_rate * self.W3.grad
+
+ self.W1.grad = None
+ self.W2.grad = None
+ self.W3.grad = None
+
+
+def torch_run():
+ train_dataset, test_dataset = download_mnist()
+
+ model = TorchModel()
+ model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False)
+
+ train_loss = []
+
+ epoch_number = 3
+ learning_rate = 0.1
+
+ for epoch in range(epoch_number):
+ for x, y in mini_batch(train_dataset, numpy=False):
+ y = one_hot(y, numpy=False)
+
+ y_pred = model.forward(x)
+ loss = (-y_pred * y).sum(dim=1).mean()
+ loss.backward()
+ model.optimize(learning_rate)
+
+ train_loss.append(loss.item())
+
+ x, y = batch(test_dataset, numpy=False)[0]
+ accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item()
+ print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+
+ plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+ torch_run()
diff --git a/assignment-2/submission/18307130090/utils.py b/assignment-2/submission/18307130090/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..709220cfa7a924d914ec1c098c505f864bcd4cfc
--- /dev/null
+++ b/assignment-2/submission/18307130090/utils.py
@@ -0,0 +1,71 @@
+import torch
+import numpy as np
+from matplotlib import pyplot as plt
+
+
+def plot_curve(data):
+ plt.plot(range(len(data)), data, color='blue')
+ plt.legend(['loss_value'], loc='upper right')
+ plt.xlabel('step')
+ plt.ylabel('value')
+ plt.show()
+
+
+def download_mnist():
+ from torchvision import datasets, transforms
+
+ transform = transforms.Compose([
+ transforms.ToTensor(),
+ transforms.Normalize(mean=(0.1307,), std=(0.3081,))
+ ])
+
+ train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True)
+ test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True)
+
+ return train_dataset, test_dataset
+
+
+def one_hot(y, numpy=True):
+ if numpy:
+ y_ = np.zeros((y.shape[0], 10))
+ y_[np.arange(y.shape[0], dtype=np.int32), y] = 1
+ return y_
+ else:
+ y_ = torch.zeros((y.shape[0], 10))
+ y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1
+ return y_
+
+
+def batch(dataset, numpy=True):
+ data = []
+ label = []
+ for each in dataset:
+ data.append(each[0])
+ label.append(each[1])
+ data = torch.stack(data)
+ label = torch.LongTensor(label)
+ if numpy:
+ return [(data.numpy(), label.numpy())]
+ else:
+ return [(data, label)]
+
+
+def mini_batch(dataset, batch_size=128, numpy=False):
+ return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
+
+
+def get_torch_initialization(numpy=True):
+ fc1 = torch.nn.Linear(28 * 28, 256)
+ fc2 = torch.nn.Linear(256, 64)
+ fc3 = torch.nn.Linear(64, 10)
+
+ if numpy:
+ W1 = fc1.weight.T.detach().clone().numpy()
+ W2 = fc2.weight.T.detach().clone().numpy()
+ W3 = fc3.weight.T.detach().clone().numpy()
+ else:
+ W1 = fc1.weight.T.detach().clone().data
+ W2 = fc2.weight.T.detach().clone().data
+ W3 = fc3.weight.T.detach().clone().data
+
+ return W1, W2, W3