diff --git a/assignment-2/submission/18307130090/README.md b/assignment-2/submission/18307130090/README.md new file mode 100644 index 0000000000000000000000000000000000000000..647eb99d08956f5fea84c6aa563ab3e1576cfcc6 --- /dev/null +++ b/assignment-2/submission/18307130090/README.md @@ -0,0 +1,276 @@ +# PRML-2021 Assignment2 + +姓名:夏海淞 + +学号:18307130090 + +## 简述 + +在本次实验中,我通过`NumPy`实现了一个简单的前馈神经网络,其中包括`numpy_fnn.py`中算子的反向传播以及前馈神经网络模型的构建。为了验证模型效果,我在MNIST数据集上进行了训练和测试。此外,我还实现了`Momentum`和`Adam`优化算法,并比较了它们的性能。 + +## 算子的反向传播 + +### `Matmul` + +`Matmul`的计算公式为: +$$ +Y=X\times W +$$ +其中$Y,X,W$分别为$n\times d',n\times d,d\times d'$的矩阵。 + +由[神经网络与深度学习-邱锡鹏](https://nndl.github.io/nndl-book.pdf)中公式(B.20)和(B.21),有 +$$ +\frac{\partial Y}{\partial W}=\frac{\partial(X\times W)}{\partial W}=X^T\\\\ +\frac{\partial Y}{\partial X}=\frac{\partial(X\times W)}{\partial X}=W^T +$$ +结合链式法则和矩阵运算法则,可得 +$$ +\nabla_X=\nabla_Y\times W^T\\\\ +\nabla_W=X^T\times \nabla_Y +$$ + +### `Relu` + +`Relu`的计算公式为: +$$ +Y_{ij}=\begin{cases} +X_{ij}&X_{ij}\ge0\\\\ +0&\text{otherwise} +\end{cases} +$$ +因此有 +$$ +\frac{\partial Y_{ij}}{\partial X_{ij}}=\begin{cases} +1&X_{ij}>0\\\\ +0&\text{otherwise} +\end{cases} +$$ +结合链式法则,得到反向传播的计算公式:$\nabla_{Xij}=\nabla_{Yij}\cdot\frac{\partial Y_{ij}}{\partial X_{ij}}$ + +### `Log` + +`Log`的计算公式为 +$$ +Y_{ij}=\ln(X_{ij}+\epsilon),\epsilon=10^{-12} +$$ +因此有 +$$ +\frac{\partial Y_{ij}}{\partial X_{ij}}=\frac1{X_{ij}+\epsilon} +$$ +结合链式法则,得到反向传播的计算公式:$\nabla_{Xij}=\nabla_{Yij}\cdot\frac{\partial Y_{ij}}{\partial {X_{ij}}}$ + +### `Softmax` + +`Softmax`的计算公式为 +$$ +Y_{ij}=\frac{\exp\{X_{ij} \}}{\sum_{k=1}^c\exp\{X_{ik} \}} +$$ +其中$Y,X$均为$N\times c$的矩阵。容易发现`Softmax`以$X$的每行作为单位进行运算。因此对于$X,Y$的行分量$X_k,Y_k$,有 +$$ +\frac{\partial Y_{ki}}{\partial X_{kj}}=\begin{cases} +\frac{\exp\{X_{kj} \}(\sum_t\exp\{X_{kt}\})-\exp\{2X_{ki}\}}{(\sum_t\exp\{X_{kt}\})^2}=Y_{ki}(1-Y_{ki})&i=j\\\\ +-\frac{\exp\{X_{ki} \}\exp\{X_{kj} \}}{(\sum_t\exp\{X_{kt}\})^2}=-Y_{ki}Y_{kj}&i\not=j +\end{cases} +$$ +因此可计算得到$X_k,Y_k$的Jacob矩阵,满足$J_{ij}=\frac{\partial Y_{ki}}{\partial X_{kj}}$。结合链式法则,可得 +$$ +\nabla_X=\nabla_Y\times J +$$ +将行分量组合起来,就得到了反向传播的最终结果。 + +## 模型构建与训练 + +### 模型构建 + +#### `forward` + +参考`torch_mnist.py`中`TorchModel`方法的模型,使用如下代码构建: + +```python +def forward(self, x): + x = x.reshape(-1, 28 * 28) + + x = self.relu_1.forward(self.matmul_1.forward(x, self.W1)) + x = self.relu_2.forward(self.matmul_2.forward(x, self.W2)) + x = self.matmul_3.forward(x, self.W3) + + x = self.log.forward(self.softmax.forward(x)) + + return x +``` + +模型的计算图如下: + +![](./img/fnn_model.png) + +#### `backward` + +根据模型的计算图,按照反向的计算顺序依次调用对应算子的反向传播算法即可。 + +```python +def backward(self, y): + self.log_grad = self.log.backward(y) + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) + + return self.x1_grad +``` + +#### `mini_batch` + +`mini_batch`的作用是提高模型的训练速度,同时得到较好的优化效果。传统的批处理方法对整个数据集计算平均的损失函数值,随后计算相应梯度进行反向传播。当训练数据集容量较大时,对训练速度造成严重影响;而随机方法则对数据集的每个样本计算损失函数值,随后计算相应梯度进行反向传播。此时数据集容量不对训练速度产生影响,然而由于样本的随机性,可能导致参数无法收敛到最优值,在最优值附近震荡。因此一个折中的方法是将数据集划分为若干批次,在提高训练速度的同时保证了较好的收敛效果。 + +在本次实验中,我参照`utils.py`中的`mini_batch`,在`numpy_mnist.py`中重新实现了`mini_batch`方法: + +```python +def mini_batch(dataset, batch_size=128): + data = np.array([np.array(each[0]) for each in dataset]) + label = np.array([each[1] for each in dataset]) + + size = data.shape[0] + index = np.arange(size) + np.random.shuffle(index) + + return [(data[index[i:i + batch_size]], label[index[i:i + batch_size]]) for i in range(0, size, batch_size)] +``` + +### 模型训练 + +设定`learning_rate=0.1`,`batch_size=128`,`epoch_number=10`。训练结果如下: + +``` +[0] Accuracy: 0.9486 +[1] Accuracy: 0.9643 +[2] Accuracy: 0.9724 +[3] Accuracy: 0.9738 +[4] Accuracy: 0.9781 +[5] Accuracy: 0.9768 +[6] Accuracy: 0.9796 +[7] Accuracy: 0.9802 +[8] Accuracy: 0.9800 +[9] Accuracy: 0.9796 +``` + + + +尝试缩减`batch_size`的大小,设定`batch_size=64`。训练结果如下: + +``` +[0] Accuracy: 0.9597 +[1] Accuracy: 0.9715 +[2] Accuracy: 0.9739 +[3] Accuracy: 0.9771 +[4] Accuracy: 0.9775 +[5] Accuracy: 0.9803 +[6] Accuracy: 0.9808 +[7] Accuracy: 0.9805 +[8] Accuracy: 0.9805 +[9] Accuracy: 0.9716 +``` + + + +尝试降低`learning_rate`,设定`learning_rate=0.01`。训练结果如下: + +``` +[0] Accuracy: 0.8758 +[1] Accuracy: 0.9028 +[2] Accuracy: 0.9143 +[3] Accuracy: 0.9234 +[4] Accuracy: 0.9298 +[5] Accuracy: 0.9350 +[6] Accuracy: 0.9397 +[7] Accuracy: 0.9434 +[8] Accuracy: 0.9459 +[9] Accuracy: 0.9501 +``` + + + +根据实验结果,可以得出以下结论: + +当学习率和批处理容量合适时,参数的收敛速度随着学习率的减小而减小,而参数的震荡幅度随着批处理容量的减小而增大。 + +## 梯度下降算法的改进 + +传统的梯度下降算法可以表述为: +$$ +w_{t+1}=w_t-\eta\cdot\nabla f(w_t) +$$ +尽管梯度下降作为优化算法被广泛使用,它依然存在一些缺点,主要表现为: + +- 参数修正方向完全由当前梯度决定,导致当学习率过高时参数可能在最优点附近震荡; +- 学习率无法随着训练进度改变,导致训练前期收敛速度较慢,后期可能无法收敛。 + +针对上述缺陷,产生了许多梯度下降算法的改进算法。其中较为典型的是`Momentum`算法和`Adam`算法。 + +### `Momentum` + +针对“参数修正方向完全由当前梯度决定”的问题,`Momentum`引入了“动量”的概念。 + +类比现实世界,当小球从高处向低处滚动时,其运动方向不仅与当前位置的“陡峭程度”相关,也和当前的速度,即先前位置的“陡峭程度”相关。因此在`Momentum`算法中,参数的修正值不是取决于当前梯度,而是取决于梯度的各时刻的指数移动平均值: +$$ +m_t=\beta\cdot m_{t-1}+(1-\beta)\cdot\nabla f(w_t)\\\\ +w_{t+1}=w_t-\eta\cdot m_t +$$ +指数移动平均值反映了参数调整时的“惯性”。当参数调整方向正确时,`Momentum`有助于加快训练速度,减少震荡的幅度;然而当参数调整方向错误时,`Momentum`会因为无法及时调整方向造成性能上的部分损失。 + +使用`Momentum`算法的训练结果如下: + +``` +[0] Accuracy: 0.9444 +[1] Accuracy: 0.9627 +[2] Accuracy: 0.9681 +[3] Accuracy: 0.9731 +[4] Accuracy: 0.9765 +[5] Accuracy: 0.9755 +[6] Accuracy: 0.9768 +[7] Accuracy: 0.9790 +[8] Accuracy: 0.9794 +[9] Accuracy: 0.9819 +``` + + + +可以看出相较传统的梯度下降算法并无明显优势。 + +### `Adam` + +针对“学习率无法随着训练进度改变”的问题,`Adam`在`Momentum`的基础上引入了“二阶动量”的概念。 + +`Adam`的改进思路为:由于神经网络中存在大量参数,不同参数的调整频率存在差别。对于频繁更新的参数,我们希望适当降低其学习率,提高收敛概率;而对于其他参数,我们希望适当增大其学习率,加快收敛速度。同时,参数的调整频率可能发生动态改变,我们也希望学习率能够随之动态调整。 + +因为参数的调整值与当前梯度直接相关,因此取历史梯度的平方和作为衡量参数调整频率的标准。如果历史梯度平方和较大,表明参数被频繁更新,需要降低其学习率。因此梯度下降算法改写为: +$$ +m_t=\beta\cdot m_{t-1}+(1-\beta)\cdot\nabla f(w_t)\\\\ +V_t=V_{t-1}+\nabla^2f(w_t)\\\\ +w_{t+1}=w_t-\frac\eta{\sqrt{V_t}}\cdot m_t +$$ +然而,由于$V_t$关于$t$单调递增,可能导致训练后期学习率过低,参数无法收敛至最优。因此将$V_t$也改为指数移动平均值,避免了上述缺陷: +$$ +m_t=\beta_1\cdot m_{t-1}+(1-\beta_1)\cdot\nabla f(w_t)\\\\ +V_t=\beta_2\cdot V_{t-1}+(1-\beta_2)\cdot\nabla^2f(w_t)\\\\ +w_{t+1}=w_t-\frac\eta{\sqrt{V_t}}\cdot m_t +$$ +使用`Adam`算法的训练结果如下: + +``` +[0] Accuracy: 0.9657 +[1] Accuracy: 0.9724 +[2] Accuracy: 0.9759 +[3] Accuracy: 0.9769 +[4] Accuracy: 0.9788 +[5] Accuracy: 0.9778 +[6] Accuracy: 0.9775 +[7] Accuracy: 0.9759 +[8] Accuracy: 0.9786 +[9] Accuracy: 0.9779 +``` + + + +可以看出相较传统的梯度下降算法,损失函数值的震荡幅度有所减小,而收敛速度与传统方法相当。 \ No newline at end of file diff --git a/assignment-2/submission/18307130090/img/Adam.png b/assignment-2/submission/18307130090/img/Adam.png new file mode 100644 index 0000000000000000000000000000000000000000..fe0326ebad52ad9356bdd7410834d9d61e9e5152 Binary files /dev/null and b/assignment-2/submission/18307130090/img/Adam.png differ diff --git a/assignment-2/submission/18307130090/img/SGDM.png b/assignment-2/submission/18307130090/img/SGDM.png new file mode 100644 index 0000000000000000000000000000000000000000..ba7ad91c5569f2605e7944afe3803863b8072b46 Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGDM.png differ diff --git a/assignment-2/submission/18307130090/img/SGD_batch_size.png b/assignment-2/submission/18307130090/img/SGD_batch_size.png new file mode 100644 index 0000000000000000000000000000000000000000..328c4cc7bf90ef75a09f8c97ee8e9134d44a33dd Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_batch_size.png differ diff --git a/assignment-2/submission/18307130090/img/SGD_learning_rate.png b/assignment-2/submission/18307130090/img/SGD_learning_rate.png new file mode 100644 index 0000000000000000000000000000000000000000..7bca928d1aa569b08dad43d761da1b6e27e02942 Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_learning_rate.png differ diff --git a/assignment-2/submission/18307130090/img/SGD_normal.png b/assignment-2/submission/18307130090/img/SGD_normal.png new file mode 100644 index 0000000000000000000000000000000000000000..e6f3933e1bf979fa7b3b643d8f7fe823610109e9 Binary files /dev/null and b/assignment-2/submission/18307130090/img/SGD_normal.png differ diff --git a/assignment-2/submission/18307130090/img/fnn_model.png b/assignment-2/submission/18307130090/img/fnn_model.png new file mode 100644 index 0000000000000000000000000000000000000000..29ed50732a88ed1ca38a1cb3c6e82099a3d3e087 Binary files /dev/null and b/assignment-2/submission/18307130090/img/fnn_model.png differ diff --git a/assignment-2/submission/18307130090/numpy_fnn.py b/assignment-2/submission/18307130090/numpy_fnn.py new file mode 100644 index 0000000000000000000000000000000000000000..7010cad4609f7ae31b8bdc0b19cedc005c5b950c --- /dev/null +++ b/assignment-2/submission/18307130090/numpy_fnn.py @@ -0,0 +1,239 @@ +import numpy as np + + +class NumpyOp: + + def __init__(self): + self.memory = {} + self.epsilon = 1e-12 + + +class Matmul(NumpyOp): + + def forward(self, x, W): + """ + x: shape(N, d) + w: shape(d, d') + """ + self.memory['x'] = x + self.memory['W'] = W + h = np.matmul(x, W) + return h + + def backward(self, grad_y): + """ + grad_y: shape(N, d') + """ + x, W = self.memory['x'], self.memory['W'] + grad_x = np.matmul(grad_y, W.T) + grad_W = np.matmul(x.T, grad_y) + + return grad_x, grad_W + + +class Relu(NumpyOp): + + def forward(self, x): + self.memory['x'] = x + return np.where(x > 0, x, np.zeros_like(x)) + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + x = self.memory['x'] + grad_x = grad_y * np.where(x > 0, np.ones_like(x), np.zeros_like(x)) + + return grad_x + + +class Log(NumpyOp): + + def forward(self, x): + """ + x: shape(N, c) + """ + + out = np.log(x + self.epsilon) + self.memory['x'] = x + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + x = self.memory['x'] + grad_x = grad_y * np.reciprocal(x + self.epsilon) + + return grad_x + + +class Softmax(NumpyOp): + """ + softmax over last dimension + """ + + def forward(self, x): + """ + x: shape(N, c) + """ + exp_x = np.exp(x - x.max()) + exp_sum = np.sum(exp_x, axis=1, keepdims=True) + out = exp_x / exp_sum + self.memory['x'] = x + self.memory['out'] = out + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + sm = self.memory['out'] + Jacobs = np.array([np.diag(r) - np.outer(r, r) for r in sm]) + + grad_y = grad_y[:, np.newaxis, :] + grad_x = np.matmul(grad_y, Jacobs).squeeze(axis=1) + + return grad_x + + +class NumpyLoss: + + def __init__(self): + self.target = None + + def get_loss(self, pred, target): + self.target = target + return (-pred * target).sum(axis=1).mean() + + def backward(self): + return -self.target / self.target.shape[0] + + +class NumpyModel: + def __init__(self): + self.W1 = np.random.normal(size=(28 * 28, 256)) + self.W2 = np.random.normal(size=(256, 64)) + self.W3 = np.random.normal(size=(64, 10)) + + # 以下算子会在 forward 和 backward 中使用 + self.matmul_1 = Matmul() + self.relu_1 = Relu() + self.matmul_2 = Matmul() + self.relu_2 = Relu() + self.matmul_3 = Matmul() + self.softmax = Softmax() + self.log = Log() + + # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度( loss 关于算子输入的偏导) + self.x1_grad, self.W1_grad = None, None + self.relu_1_grad = None + self.x2_grad, self.W2_grad = None, None + self.relu_2_grad = None + self.x3_grad, self.W3_grad = None, None + self.softmax_grad = None + self.log_grad = None + + self.beta_1 = 0.9 + self.beta_2 = 0.999 + self.epsilon = 1e-8 + self.is_first = True + + self.W1_grad_mean = None + self.W2_grad_mean = None + self.W3_grad_mean = None + + self.W1_grad_square_mean = None + self.W2_grad_square_mean = None + self.W3_grad_square_mean = None + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + + x = self.relu_1.forward(self.matmul_1.forward(x, self.W1)) + x = self.relu_2.forward(self.matmul_2.forward(x, self.W2)) + x = self.matmul_3.forward(x, self.W3) + + x = self.log.forward(self.softmax.forward(x)) + + return x + + def backward(self, y): + self.log_grad = self.log.backward(y) + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) + + return self.x1_grad + + def optimize(self, learning_rate): + def SGD(): + self.W1 -= learning_rate * self.W1_grad + self.W2 -= learning_rate * self.W2_grad + self.W3 -= learning_rate * self.W3_grad + + def SGDM(): + if self.is_first: + self.is_first = False + + self.W1_grad_mean = self.W1_grad + self.W2_grad_mean = self.W2_grad + self.W3_grad_mean = self.W3_grad + else: + self.W1_grad_mean = self.beta_1 * self.W1_grad_mean + (1 - self.beta_1) * self.W1_grad + self.W2_grad_mean = self.beta_1 * self.W2_grad_mean + (1 - self.beta_1) * self.W2_grad + self.W3_grad_mean = self.beta_1 * self.W3_grad_mean + (1 - self.beta_1) * self.W3_grad + + delta_1 = learning_rate * self.W1_grad_mean + delta_2 = learning_rate * self.W2_grad_mean + delta_3 = learning_rate * self.W3_grad_mean + + self.W1 -= delta_1 + self.W2 -= delta_2 + self.W3 -= delta_3 + + def Adam(learning_rate=0.001): + if self.is_first: + self.is_first = False + self.W1_grad_mean = self.W1_grad + self.W2_grad_mean = self.W2_grad + self.W3_grad_mean = self.W3_grad + + self.W1_grad_square_mean = np.square(self.W1_grad) + self.W2_grad_square_mean = np.square(self.W2_grad) + self.W3_grad_square_mean = np.square(self.W3_grad) + + self.W1 -= learning_rate * self.W1_grad_mean + self.W2 -= learning_rate * self.W2_grad_mean + self.W3 -= learning_rate * self.W3_grad_mean + else: + self.W1_grad_mean = self.beta_1 * self.W1_grad_mean + (1 - self.beta_1) * self.W1_grad + self.W2_grad_mean = self.beta_1 * self.W2_grad_mean + (1 - self.beta_1) * self.W2_grad + self.W3_grad_mean = self.beta_1 * self.W3_grad_mean + (1 - self.beta_1) * self.W3_grad + + self.W1_grad_square_mean = self.beta_2 * self.W1_grad_square_mean + (1 - self.beta_2) * np.square( + self.W1_grad) + self.W2_grad_square_mean = self.beta_2 * self.W2_grad_square_mean + (1 - self.beta_2) * np.square( + self.W2_grad) + self.W3_grad_square_mean = self.beta_2 * self.W3_grad_square_mean + (1 - self.beta_2) * np.square( + self.W3_grad) + + delta_1 = learning_rate * self.W1_grad_mean * np.reciprocal( + np.sqrt(self.W1_grad_square_mean) + np.full_like(self.W1_grad_square_mean, self.epsilon)) + delta_2 = learning_rate * self.W2_grad_mean * np.reciprocal( + np.sqrt(self.W2_grad_square_mean) + np.full_like(self.W2_grad_square_mean, self.epsilon)) + delta_3 = learning_rate * self.W3_grad_mean * np.reciprocal( + np.sqrt(self.W3_grad_square_mean) + np.full_like(self.W3_grad_square_mean, self.epsilon)) + + self.W1 -= delta_1 + self.W2 -= delta_2 + self.W3 -= delta_3 + + # SGD() + # SGDM() + Adam() diff --git a/assignment-2/submission/18307130090/numpy_mnist.py b/assignment-2/submission/18307130090/numpy_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..6d67f25824dabdc5791ae5cc96655affe8315e72 --- /dev/null +++ b/assignment-2/submission/18307130090/numpy_mnist.py @@ -0,0 +1,50 @@ +import numpy as np + +from numpy_fnn import NumpyModel, NumpyLoss +from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot + + +def mini_batch(dataset, batch_size=128): + data = np.array([np.array(each[0]) for each in dataset]) + label = np.array([each[1] for each in dataset]) + + size = data.shape[0] + index = np.arange(size) + np.random.shuffle(index) + + return [(data[index[i:i + batch_size]], label[index[i:i + batch_size]]) for i in range(0, size, batch_size)] + + +def numpy_run(): + train_dataset, test_dataset = download_mnist() + + model = NumpyModel() + numpy_loss = NumpyLoss() + model.W1, model.W2, model.W3 = get_torch_initialization() + + train_loss = [] + + epoch_number = 10 + learning_rate = 0.1 + + for epoch in range(epoch_number): + for x, y in mini_batch(train_dataset): + y = one_hot(y) + + y_pred = model.forward(x) + loss = numpy_loss.get_loss(y_pred, y) + + model.backward(numpy_loss.backward()) + model.optimize(learning_rate) + + train_loss.append(loss.item()) + + x, y = batch(test_dataset)[0] + accuracy = np.mean((model.forward(x).argmax(axis=1) == y)) + print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy)) + + plot_curve(train_loss) + + +if __name__ == "__main__": + numpy_run() diff --git a/assignment-2/submission/18307130090/tester_demo.py b/assignment-2/submission/18307130090/tester_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..504b3eef50a6df4d0aa433113136add50835e420 --- /dev/null +++ b/assignment-2/submission/18307130090/tester_demo.py @@ -0,0 +1,182 @@ +import numpy as np +import torch +from torch import matmul as torch_matmul, relu as torch_relu, softmax as torch_softmax, log as torch_log + +from numpy_fnn import Matmul, Relu, Softmax, Log, NumpyModel, NumpyLoss +from torch_mnist import TorchModel +from utils import get_torch_initialization, one_hot + +err_epsilon = 1e-6 +err_p = 0.4 + + +def check_result(numpy_result, torch_result=None): + if isinstance(numpy_result, list) and torch_result is None: + flag = True + for (n, t) in numpy_result: + flag = flag and check_result(n, t) + return flag + # print((torch.from_numpy(numpy_result) - torch_result).abs().mean().item()) + T = (torch_result * torch.from_numpy(numpy_result) < 0).sum().item() + direction = T / torch_result.numel() < err_p + return direction and ((torch.from_numpy(numpy_result) - torch_result).abs().mean() < err_epsilon).item() + + +def case_1(): + x = np.random.normal(size=[5, 6]) + W = np.random.normal(size=[6, 4]) + + numpy_matmul = Matmul() + numpy_out = numpy_matmul.forward(x, W) + numpy_x_grad, numpy_W_grad = numpy_matmul.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + torch_W = torch.from_numpy(W).clone().requires_grad_() + + torch_out = torch_matmul(torch_x, torch_W) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + (numpy_W_grad, torch_W.grad) + ]) + + +def case_2(): + x = np.random.normal(size=[5, 6]) + + numpy_relu = Relu() + numpy_out = numpy_relu.forward(x) + numpy_x_grad = numpy_relu.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_relu(torch_x) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + ]) + + +def case_3(): + x = np.random.uniform(low=0.0, high=1.0, size=[3, 4]) + + numpy_log = Log() + numpy_out = numpy_log.forward(x) + numpy_x_grad = numpy_log.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_log(torch_x) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + + (numpy_x_grad, torch_x.grad), + ]) + + +def case_4(): + x = np.random.normal(size=[4, 5]) + + numpy_softmax = Softmax() + numpy_out = numpy_softmax.forward(x) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_softmax(torch_x, 1) + + return check_result(numpy_out, torch_out) + + +def case_5(): + x = np.random.normal(size=[20, 25]) + + numpy_softmax = Softmax() + numpy_out = numpy_softmax.forward(x) + numpy_x_grad = numpy_softmax.backward(np.ones_like(numpy_out)) + + torch_x = torch.from_numpy(x).clone().requires_grad_() + + torch_out = torch_softmax(torch_x, 1) + torch_out.sum().backward() + + return check_result([ + (numpy_out, torch_out), + (numpy_x_grad, torch_x.grad), + ]) + + +def test_model(): + try: + numpy_loss = NumpyLoss() + numpy_model = NumpyModel() + torch_model = TorchModel() + torch_model.W1.data, torch_model.W2.data, torch_model.W3.data = get_torch_initialization(numpy=False) + numpy_model.W1 = torch_model.W1.detach().clone().numpy() + numpy_model.W2 = torch_model.W2.detach().clone().numpy() + numpy_model.W3 = torch_model.W3.detach().clone().numpy() + + x = torch.randn((10000, 28, 28)) + y = torch.tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 0] * 1000) + + y = one_hot(y, numpy=False) + x2 = x.numpy() + y_pred = torch_model.forward(x) + loss = (-y_pred * y).sum(dim=1).mean() + loss.backward() + + y_pred_numpy = numpy_model.forward(x2) + numpy_loss.get_loss(y_pred_numpy, y.numpy()) + + check_flag_1 = check_result(y_pred_numpy, y_pred) + print("+ {:12} {}/{}".format("forward", 10 * check_flag_1, 10)) + except: + print("[Runtime Error in forward]") + print("+ {:12} {}/{}".format("forward", 0, 10)) + return 0 + + try: + + numpy_model.backward(numpy_loss.backward()) + + check_flag_2 = [ + check_result(numpy_model.log_grad, torch_model.log_input.grad), + check_result(numpy_model.softmax_grad, torch_model.softmax_input.grad), + check_result(numpy_model.W3_grad, torch_model.W3.grad), + check_result(numpy_model.W2_grad, torch_model.W2.grad), + check_result(numpy_model.W1_grad, torch_model.W1.grad) + ] + check_flag_2 = sum(check_flag_2) >= 4 + print("+ {:12} {}/{}".format("backward", 20 * check_flag_2, 20)) + except: + print("[Runtime Error in backward]") + print("+ {:12} {}/{}".format("backward", 0, 20)) + check_flag_2 = False + + return 10 * check_flag_1 + 20 * check_flag_2 + + +if __name__ == "__main__": + testcases = [ + ["matmul", case_1, 5], + ["relu", case_2, 5], + ["log", case_3, 5], + ["softmax_1", case_4, 5], + ["softmax_2", case_5, 10], + ] + score = 0 + for case in testcases: + try: + res = case[2] if case[1]() else 0 + except: + print("[Runtime Error in {}]".format(case[0])) + res = 0 + score += res + print("+ {:12} {}/{}".format(case[0], res, case[2])) + score += test_model() + print("{:14} {}/60".format("FINAL SCORE", score)) diff --git a/assignment-2/submission/18307130090/torch_mnist.py b/assignment-2/submission/18307130090/torch_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..6d3e214c7606e3d43dac4b94554f942508afffb3 --- /dev/null +++ b/assignment-2/submission/18307130090/torch_mnist.py @@ -0,0 +1,73 @@ +import torch +from utils import mini_batch, batch, download_mnist, get_torch_initialization, one_hot, plot_curve + + +class TorchModel: + + def __init__(self): + self.W1 = torch.randn((28 * 28, 256), requires_grad=True) + self.W2 = torch.randn((256, 64), requires_grad=True) + self.W3 = torch.randn((64, 10), requires_grad=True) + self.softmax_input = None + self.log_input = None + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + x = torch.relu(torch.matmul(x, self.W1)) + x = torch.relu(torch.matmul(x, self.W2)) + x = torch.matmul(x, self.W3) + + self.softmax_input = x + self.softmax_input.retain_grad() + + x = torch.softmax(x, 1) + + self.log_input = x + self.log_input.retain_grad() + + x = torch.log(x) + + return x + + def optimize(self, learning_rate): + with torch.no_grad(): + self.W1 -= learning_rate * self.W1.grad + self.W2 -= learning_rate * self.W2.grad + self.W3 -= learning_rate * self.W3.grad + + self.W1.grad = None + self.W2.grad = None + self.W3.grad = None + + +def torch_run(): + train_dataset, test_dataset = download_mnist() + + model = TorchModel() + model.W1.data, model.W2.data, model.W3.data = get_torch_initialization(numpy=False) + + train_loss = [] + + epoch_number = 3 + learning_rate = 0.1 + + for epoch in range(epoch_number): + for x, y in mini_batch(train_dataset, numpy=False): + y = one_hot(y, numpy=False) + + y_pred = model.forward(x) + loss = (-y_pred * y).sum(dim=1).mean() + loss.backward() + model.optimize(learning_rate) + + train_loss.append(loss.item()) + + x, y = batch(test_dataset, numpy=False)[0] + accuracy = model.forward(x).argmax(dim=1).eq(y).float().mean().item() + print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy)) + + plot_curve(train_loss) + + +if __name__ == "__main__": + torch_run() diff --git a/assignment-2/submission/18307130090/utils.py b/assignment-2/submission/18307130090/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..709220cfa7a924d914ec1c098c505f864bcd4cfc --- /dev/null +++ b/assignment-2/submission/18307130090/utils.py @@ -0,0 +1,71 @@ +import torch +import numpy as np +from matplotlib import pyplot as plt + + +def plot_curve(data): + plt.plot(range(len(data)), data, color='blue') + plt.legend(['loss_value'], loc='upper right') + plt.xlabel('step') + plt.ylabel('value') + plt.show() + + +def download_mnist(): + from torchvision import datasets, transforms + + transform = transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize(mean=(0.1307,), std=(0.3081,)) + ]) + + train_dataset = datasets.MNIST(root="./data/", transform=transform, train=True, download=True) + test_dataset = datasets.MNIST(root="./data/", transform=transform, train=False, download=True) + + return train_dataset, test_dataset + + +def one_hot(y, numpy=True): + if numpy: + y_ = np.zeros((y.shape[0], 10)) + y_[np.arange(y.shape[0], dtype=np.int32), y] = 1 + return y_ + else: + y_ = torch.zeros((y.shape[0], 10)) + y_[torch.arange(y.shape[0], dtype=torch.long), y] = 1 + return y_ + + +def batch(dataset, numpy=True): + data = [] + label = [] + for each in dataset: + data.append(each[0]) + label.append(each[1]) + data = torch.stack(data) + label = torch.LongTensor(label) + if numpy: + return [(data.numpy(), label.numpy())] + else: + return [(data, label)] + + +def mini_batch(dataset, batch_size=128, numpy=False): + return torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True) + + +def get_torch_initialization(numpy=True): + fc1 = torch.nn.Linear(28 * 28, 256) + fc2 = torch.nn.Linear(256, 64) + fc3 = torch.nn.Linear(64, 10) + + if numpy: + W1 = fc1.weight.T.detach().clone().numpy() + W2 = fc2.weight.T.detach().clone().numpy() + W3 = fc3.weight.T.detach().clone().numpy() + else: + W1 = fc1.weight.T.detach().clone().data + W2 = fc2.weight.T.detach().clone().data + W3 = fc3.weight.T.detach().clone().data + + return W1, W2, W3