diff --git a/assignment-2/submission/17307130243/README.md b/assignment-2/submission/17307130243/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a0f392231337f491e92e7b3e730d9f93652612a5 --- /dev/null +++ b/assignment-2/submission/17307130243/README.md @@ -0,0 +1,337 @@ +# Assignment2——前馈神经网络 + +------ + +## 算子反向传播推导与实现 + +------ + +以下记损失函数为$L$, 各算子前向输出为$Y$, $L$ 关于 $Y$ 的梯度为$\frac{\partial L}{\partial Y}$. + +### Matmul +对输入$X$的每一行$x$,记$Y$的对应行为$y$,则$xW=y$. +由向量值函数求导公式$\frac{\partial y}{\partial x}=W^T$以及链式法则可知$\frac{\partial L}{\partial x}=\frac{\partial L}{\partial y}\frac{\partial y}{\partial x}=\frac{\partial L}{\partial y}W^T$,于是$\frac{\partial L}{\partial X}=\frac{\partial L}{\partial Y}W^T$. +类似可得$\frac{\partial L}{\partial w}=X^T\frac{\partial L}{\partial Y}$. + +代码如下 + + +```{python} + def backward(self, grad_y): + """ + grad_y: shape(N, d') + """ + grad_x = np.matmul(grad_y, self.memory['W'].T) + grad_W = np.matmul(self.memory['x'].T, grad_y) + + return grad_x, grad_W +``` + +### ReLU + +$\frac{\partial y}{\partial x}=\begin{cases}0 \quad x < 0 \\\\ 1 \quad x\geq 0\end{cases}$ +故由链式法则,每一行,$\frac{\partial L}{\partial x}=\frac{\partial L}{\partial y}\odot\frac{\partial y}{\partial x}=\begin{cases}0 \quad x < 0 \\\\ \frac{\partial L}{\partial y}\quad x\geq 0\end{cases}$ + +```{python} + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + # max(x, 0) + x = self.memory['x'] + grad_x = np.where(x > 0, grad_y, np.zeros_like(grad_y)) + + return grad_x +``` + +### Log + +$\frac{\partial y}{\partial x}=\frac{1}{x+\epsilon}$ + +故由链式法则,每一行,$\frac{\partial L}{\partial x}=\frac{\partial L}{\partial y}\odot \frac{1}{x+\epsilon}$ +```{python} + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + x = self.memory['x'] + grad_x = (1 / (x + self.epsilon)) * grad_y + + return grad_x +``` + +### Softmax + +- forward +代码实现时,为了防止数据上溢,对原始输入中的每一行减去其最大值后再进行指数运算。 + +```{python} + def forward(self, x): + """ + x: shape(N, c) + """ + self.memory['x'] = x + exp = np.exp(x - np.max(x, axis=1, keepdims=True)) + out = exp / np.sum(exp, axis=1, keepdims=True) + self.memory['out'] = out + + return out +``` + +- backward +课本附录中,对每一行: + +![softmax](./img/softmax.png) + +代码实现如下: + +```{python} + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + out = self.memory['out'] + J = np.array([np.diag(i) - np.outer(i, i) for i in out]) + + grad_y = grad_y[:, np.newaxis, :] + grad_x = np.matmul(grad_y, J).squeeze(axis=1) + + return grad_x + +``` + +## 实验 + +------ +实验中主要修改了`numpy_mnist.py`中的`numpy_run`函数,使其可以接收参数`learning_rate`,`epoch_number`,`batch_size`,`optimizer`与`max_iter`五个参数,运用在`numpy_fnn`中定义的网络结构进行训练并对结果进行可视化。后续实验中通过对`numpy_run`传入不同的参数来分析与比较各参数对模型学习性能的影响。 + +### 网络搭建 + +阅读文件`torch_mnist.py`中的`TorchModel`类代码,可知其中定义了如下所示的前馈神经网络结构: +> 网络结构图 +![网络模型结构图](./img/net_structure.png) + +> 计算图 +![计算图](./img/net_structure-1.png) + +于是,根据网络结构,模仿`TorchModel`代码,利用上面用`numpy`实现的算子搭建该网络,具体代码如下: + +- foward + +```{python} + x = self.matmul_1.forward(x, self.W1) + x = self.relu_1.forward(x) + + x = self.matmul_2.forward(x, self.W2) + x = self.relu_2.forward(x) + + x = self.matmul_3.forward(x, self.W3) + x = self.softmax.forward(x) + + x = self.log.forward(x) +``` + +- backward + + +```{python} + self.log_grad = self.log.backward(y) + + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) +``` + +### mini-batch的numpy实现 + +用`numpy`实现的 mini-batch 方法位于文件`numpy_mnist` 的函数 `mini_batch`,其将数据集随机打乱后划分为指定大小(batch-size)的多个批次,后续优化中每步梯度下降轮流使用其中的单个批次来计算梯度。以下是原框架代码中调用了`PyTorch`中的`Dataloader`实现的 `mini-batch` 与实验中用`numpy`实现的`mini-batch`的分类效果对比实验的可视化结果。实验中,采用`MNIST`数据集,运用上面搭建的网络结构,令学习率为0.01,batch-size为128,epoch数量为5进行训练。可以看到,两者loss曲线相似,训练集准确率相近,说明`mini-batch`的numpy实现大致正确。 + +![mini_batch对比](./img/mini_batch.png) + +> 左:使用原代码中mini-batch的实验结果 + +> 右:使用自行实现的mini-batch的实验结果 +> +|Accuracy|epoch 1|epoch 2|epoch 3|epoch 4|epoch 5| +|--|--|--|--|--|--| +|torch mini-batch|0.9510|0.9640|0.9729|0.9773|0.9773| +|numpy mini-batch|0.9337|0.9636|0.9716|0.9748|0.9756| + + +### learning-rate实验 + +这部分实验对模型设定多种不同的学习率大小后来进行训练,并根据实验结果对学习率大小对模型性能的影响进行分析。 + +实验中所使用的学习率大小 : + +|learning_rate|0.0001|0.001|0.01|0.1|1|2|5| +|--|--|--|--|--|--|--|--| + +实验中所使用的其他参数(固定不变): + +|epoch number|batch size|optimizer| +|--|--|--| +|15|128|SGD| + +#### 实验结果 + +> 不同学习率下 Accuracy与Epoch的关系图 +![lr](./img/learning_rate.png) + +从图中可以看到,当学习率过大(实验中为大于等于1的情况)时,模型的准确率维持在0.1左右保持不变,模型无法学习。由于模型中使用了ReLU作为激活函数,可以猜测这种情况的发生是由于学习率过大导致了梯度消失、死亡ReLU的现象。而学习率取其他待选值时,模型的训练集准确率都随训练步数而提高,且学习率越大,模型性能提升得越快。学习率取0.1时,模型在第一个epoch后就达到了很高的训练集准确率,并且在随后的训练中也总是胜过其他取值。学习率取0.0001时,模型准确率虽然在训练过程中不断上升,但其起点很低,提升速度也慢,说明这不是一个好的取值。 + +> 模型在不同学习率下,每50次迭代后的训练集损失 : +> +|learning rate = 0.0001|learning rate = 0.001| +|--|--| +|![lr=0.0001](./img/lr_1.png)|![lr=0.001](./img/lr_2.png)| +|**learning rate = 0.01**|**learning rate = 0.1**| +|![lr=0.01](./img/lr_3.png)|![lr=0.1](./img/lr_4.png)| + +从图中可以看出,除了学习率取0.1的情况,其余取值下模型的训练集损失虽然以不同速度降低,但都有较大程度的波动,并未接近收敛。 + +|Accuracy|epoch 1|epoch 3|epoch 5| epoch 7| epoch 9|epoch 11|epoch 13| epoch 15| +|--|--|--|--|--|--|--|--|--| +lr = 0.0001|0.1043|0.1578|0.2507|0.3498|0.4308|0.5035|0.5736|0.6268| +|lr=0.001|0.4540|0.7319|0.8226|0.8503|0.8675|0.8798|0.8883|0.8944| +|lr=0.01|0.8692|0.9148|0.9287|0.9379|0.9447|0.9509|0.9566|0.9602| +|lr=0.1|0.9417|0.9637|0.9746|0.9774|0.9787|0.9705|0.9781|0.9801| + +**总结**:为追求模型训练的速度与性能,学习率的选取不应过小。而当学习率过大时,可能会产生数据异常、梯度消失或者模型不稳定等现象,因此过大的学习率也应当规避。 + +### batch-size 实验 + +这部分实验对模型设定不同的batch-size并进行训练,分析batch-size对模型性能的影响。 + +实验中所选用的batch-size: +|batch-size|1|8|16|32|64|128|256|512|1024| +|--|--|--|--|--|--|--|--|--|--| + +实验中使用的其他参数: +|learning rate| epoch number| optimizer| max_iter| +|--|--|--|--| +|0.01|5|SGD|5000| + +由于数据量一定,batch size的大小影响着梯度下降的迭代次数,为了研究batch size对模型性能的影响,需要消减迭代次数所带来的效果提升在其中的作用,所以代码中设置了`max_iter`为最大迭代次数,并将其设置为5000,迭代次数大于5000或epoch达到5时停止训练。 + +#### 实验结果 + +![batch size](./img/batch_size.png) +|batch size=1|batch size=16|batch size=32|batch size=64| +|--|--|--|--| +|![b=1](./img/b1_1.png)|![b=16](./img/b16_1.png)|![b=32](./img/b32_1.png)|![b=64](./img/b64_1.png)| +|**batch size=128**|**batch size=256**|**batch size=512**|**batch size=1024**| +|![b=1](./img/b128_1.png)|![b=16](./img/b256_1.png)|![b=32](./img/b512_1.png)|![b=1024](./img/b1024_1.png)| + +可以看到,随着 batch size 增加,由于批量数据与整体数据更相似,模型的表现更加稳定,但是loss下降变慢,训练速度变慢,因此batch size的选取也应当适中。 + + +### 优化方法实验 + +------ +本部分参考了[CS231n](https://cs231n.github.io/neural-networks-3/#sgd). + +#### SGD + +SGD为上面实验中所采用的优化方法,其在每次迭代中计算一个batch的梯度,并根据固定的学习率与梯度的乘积来更新参数。如上面实验所见,在SGD方法下,选择适当的学习率是不容易的,过小则收敛缓慢,过大则可能造成异常,而且学习率是固定的这一点也不尽合理。以下为几种对梯度下降算法的改进,各有优势。 + +#### SGD with momentum +Momentum方法借鉴了物理学中的动量概念,其保留一定的原始梯度,在更新梯度的时候利用它对当前梯度做微调,得到最终的更新方向,这可以增加模型的稳定性。 + +示意代码: +```{python} +# Momentum update +v = mu * v - learning_rate * dx # integrate velocity +x += v # integrate position +``` + +#### AdaGrad +AdaGrad算法在训练过程中能对学习率根据梯度大小进行自动调整,使梯度大的参数对应的学习率更快地衰减,其示意代码如下: + +```{python} +# Assume the gradient dx and parameter vector x +cache += dx**2 +x += - learning_rate * dx / (np.sqrt(cache) + eps) +``` + +#### RMSprop + +AdaGrad一大弊病是学习率会单调下降且下降速度较快,RMSProp试图避免这种情况,其利用梯度平方的均值来更新改变学习率,而非累加梯度的平方。 + +示意代码: +```{python} +cache = decay_rate * cache + (1 - decay_rate) * dx**2 +x += - learning_rate * dx / (np.sqrt(cache) + eps) +``` +其中 decay_rate 一般取 0.9/0.99/0.999 + +#### Adam +Adam是一种结合了上述多种方法优点的算法,其示意代码如下: + +```{python} +# t is your iteration counter going from 1 to infinity +m = beta1*m + (1-beta1)*dx +mt = m / (1-beta1**t) +v = beta2*v + (1-beta2)*(dx**2) +vt = v / (1-beta2**t) +x += - learning_rate * mt / (np.sqrt(vt) + eps) +``` +一般地,beta1取0.9,beta2取0.999, eps取1e-8 + + +#### 实验 + +代码中运用`numpy`对上述方法均予以实现,位于`numpy_fnn.py`文件中的类`Numpy_Model`中。 + +实验中分别对模型设定不同的优化器进行训练,实验使用的其它参数如下: +|learning rate|epoch number|batch size| +|--|--|--| +|0.002|10|128| + + +#### 实验结果 + +![opt1](./img/opt_1.png) + +从图中可以看出,Adam算法使模型快速收敛,并有很高的准确率,优势很明显;RMSProp算法可能由于参数不匹配而出现了数值异常;Momentum算法与AdaGrad算法均比SGD稳定一些,但在实验所用模型上效果相近。 + + +### 权重初始化方法 + +在神经网络训练中,参数的初始化对迭代是否收敛,以及是否收敛到最优解等重要的问题有很大的影响。在本次实验所使用的前馈神经网络中,如果权重初始化不当,有可能出现梯度消失或梯度爆炸问题。实验的这一部分探究了原始代码中权重的初始化方式。 + +首先,通过阅读`utils.py`中的`get_torch_initialization`函数可以得知,代码中借用一个同样大小`torch.nn.Linear`线性层的初始化权重来对权重进行初始化。 + +然后,阅读Pytorch的官方文档,可知`torch.nn.Linear`的权重初始化方式是`init.kaiming_uniform_(self.weight, a=math.sqrt(5))`,即kaiming均匀分布初始化,其规定权重矩阵的元素服从$(-bound, bound)$上均匀分布,这里$bound=\sqrt{\frac{6}{(1+a^2)d}}$, $d$为输入层的大小,$a$为非线性激活函数在负半轴上的导数值(leaky-ReLU,实验中为ReLU, 故$a$取0)。 + +于是,在`numpy_mnist.py`中运用`numpy`实现权重初始化如下: + +```{python} +def get_torch_initialization_numpy(): + bound1 = np.sqrt(6 / (28 * 28)) + bound2 = np.sqrt(6 / 256) + bound3 = np.sqrt(6 / 64) + + W1 = np.random.uniform(-bound1, bound1, (28 * 28, 256)) + W2 = np.random.uniform(-bound2, bound2, (256, 64)) + W3 = np.random.uniform(-bound3, bound3, (64, 10)) + + return W1, W2, W3 +``` + +#### 实验 +调用默认`numpy_run()`函数进行实验,结果如下: +![init](./img/init.png) +``` +[0] Accuracy: 0.9553 +[1] Accuracy: 0.9595 +[2] Accuracy: 0.9694 +``` +结果与原始初始化的结果相近。 \ No newline at end of file diff --git a/assignment-2/submission/17307130243/img/.keep b/assignment-2/submission/17307130243/img/.keep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assignment-2/submission/17307130243/img/b1024_1.png b/assignment-2/submission/17307130243/img/b1024_1.png new file mode 100644 index 0000000000000000000000000000000000000000..b0ea36943a9311a5d806e52fa7e859d282c16c62 Binary files /dev/null and b/assignment-2/submission/17307130243/img/b1024_1.png differ diff --git a/assignment-2/submission/17307130243/img/b128_1.png b/assignment-2/submission/17307130243/img/b128_1.png new file mode 100644 index 0000000000000000000000000000000000000000..25166b8f1baacacfe69b47b93ed32e447e0f0bcd Binary files /dev/null and b/assignment-2/submission/17307130243/img/b128_1.png differ diff --git a/assignment-2/submission/17307130243/img/b16_1.png b/assignment-2/submission/17307130243/img/b16_1.png new file mode 100644 index 0000000000000000000000000000000000000000..8b8e96367ed6ef28ad73eee03716d8f64e6e9f6f Binary files /dev/null and b/assignment-2/submission/17307130243/img/b16_1.png differ diff --git a/assignment-2/submission/17307130243/img/b1_1.png b/assignment-2/submission/17307130243/img/b1_1.png new file mode 100644 index 0000000000000000000000000000000000000000..2a6c10150b2c74608e89d5ebb7cfb1a05deaa26f Binary files /dev/null and b/assignment-2/submission/17307130243/img/b1_1.png differ diff --git a/assignment-2/submission/17307130243/img/b256_1.png b/assignment-2/submission/17307130243/img/b256_1.png new file mode 100644 index 0000000000000000000000000000000000000000..d830aeb3df9b99563c6259e151ba593c54c06cf6 Binary files /dev/null and b/assignment-2/submission/17307130243/img/b256_1.png differ diff --git a/assignment-2/submission/17307130243/img/b32_1.png b/assignment-2/submission/17307130243/img/b32_1.png new file mode 100644 index 0000000000000000000000000000000000000000..bc3d4b8915bc3d4e7186f29f3d241f9a5ba084cc Binary files /dev/null and b/assignment-2/submission/17307130243/img/b32_1.png differ diff --git a/assignment-2/submission/17307130243/img/b512_1.png b/assignment-2/submission/17307130243/img/b512_1.png new file mode 100644 index 0000000000000000000000000000000000000000..74169fcbf6e897d2ee78be6444187edd2d88331f Binary files /dev/null and b/assignment-2/submission/17307130243/img/b512_1.png differ diff --git a/assignment-2/submission/17307130243/img/b64_1.png b/assignment-2/submission/17307130243/img/b64_1.png new file mode 100644 index 0000000000000000000000000000000000000000..7863486a4371529af9f256b0c46e3849c6215a90 Binary files /dev/null and b/assignment-2/submission/17307130243/img/b64_1.png differ diff --git a/assignment-2/submission/17307130243/img/batch_size.png b/assignment-2/submission/17307130243/img/batch_size.png new file mode 100644 index 0000000000000000000000000000000000000000..ffc6954e0378aa70440fda924aea5116a9687287 Binary files /dev/null and b/assignment-2/submission/17307130243/img/batch_size.png differ diff --git a/assignment-2/submission/17307130243/img/init.png b/assignment-2/submission/17307130243/img/init.png new file mode 100644 index 0000000000000000000000000000000000000000..7d6feca8b26759afc49cc2c638b1d4d2a5c84888 Binary files /dev/null and b/assignment-2/submission/17307130243/img/init.png differ diff --git a/assignment-2/submission/17307130243/img/learning_rate.png b/assignment-2/submission/17307130243/img/learning_rate.png new file mode 100644 index 0000000000000000000000000000000000000000..120811f9c7b1e08106c7813ad1505620700496b8 Binary files /dev/null and b/assignment-2/submission/17307130243/img/learning_rate.png differ diff --git a/assignment-2/submission/17307130243/img/lr_1.png b/assignment-2/submission/17307130243/img/lr_1.png new file mode 100644 index 0000000000000000000000000000000000000000..4cfd56e69baff52ec48c30659a2606dedf5480ec Binary files /dev/null and b/assignment-2/submission/17307130243/img/lr_1.png differ diff --git a/assignment-2/submission/17307130243/img/lr_2.png b/assignment-2/submission/17307130243/img/lr_2.png new file mode 100644 index 0000000000000000000000000000000000000000..cdd23eec5b965b0ef9d9332f4fa6700538ad11d4 Binary files /dev/null and b/assignment-2/submission/17307130243/img/lr_2.png differ diff --git a/assignment-2/submission/17307130243/img/lr_3.png b/assignment-2/submission/17307130243/img/lr_3.png new file mode 100644 index 0000000000000000000000000000000000000000..54fb2768f6e9ae91ab2d8806fe13f2587595f732 Binary files /dev/null and b/assignment-2/submission/17307130243/img/lr_3.png differ diff --git a/assignment-2/submission/17307130243/img/lr_4.png b/assignment-2/submission/17307130243/img/lr_4.png new file mode 100644 index 0000000000000000000000000000000000000000..f751ecc7fceb471e8dc52867f3c2da0e44b5a312 Binary files /dev/null and b/assignment-2/submission/17307130243/img/lr_4.png differ diff --git a/assignment-2/submission/17307130243/img/mini_batch.png b/assignment-2/submission/17307130243/img/mini_batch.png new file mode 100644 index 0000000000000000000000000000000000000000..ac47c7991574a2e9729cf30bd9a18481e551ed06 Binary files /dev/null and b/assignment-2/submission/17307130243/img/mini_batch.png differ diff --git a/assignment-2/submission/17307130243/img/net_structure-1.png b/assignment-2/submission/17307130243/img/net_structure-1.png new file mode 100644 index 0000000000000000000000000000000000000000..3f96cf02e720739b574d103b2a2d467f798acac3 Binary files /dev/null and b/assignment-2/submission/17307130243/img/net_structure-1.png differ diff --git a/assignment-2/submission/17307130243/img/net_structure.png b/assignment-2/submission/17307130243/img/net_structure.png new file mode 100644 index 0000000000000000000000000000000000000000..3997bba001bf9d6175183c5f2eed24ed2930e2d6 Binary files /dev/null and b/assignment-2/submission/17307130243/img/net_structure.png differ diff --git a/assignment-2/submission/17307130243/img/opt_1.png b/assignment-2/submission/17307130243/img/opt_1.png new file mode 100644 index 0000000000000000000000000000000000000000..5305066801bd4ea98d0c6b171e2e7569f0cbfeb4 Binary files /dev/null and b/assignment-2/submission/17307130243/img/opt_1.png differ diff --git a/assignment-2/submission/17307130243/img/softmax.png b/assignment-2/submission/17307130243/img/softmax.png new file mode 100644 index 0000000000000000000000000000000000000000..71fa0c41f4df0c1f969353666a9430c57e7c252d Binary files /dev/null and b/assignment-2/submission/17307130243/img/softmax.png differ diff --git a/assignment-2/submission/17307130243/numpy_fnn.py b/assignment-2/submission/17307130243/numpy_fnn.py new file mode 100644 index 0000000000000000000000000000000000000000..e3829e4d8d24ce459ffbddcb8e705ab8029aa2da --- /dev/null +++ b/assignment-2/submission/17307130243/numpy_fnn.py @@ -0,0 +1,267 @@ +import numpy as np + + +class NumpyOp: + + def __init__(self): + self.memory = {} + self.epsilon = 1e-12 + + +class Matmul(NumpyOp): + + def forward(self, x, W): + """ + x: shape(N, d) + w: shape(d, d') + """ + self.memory['x'] = x + self.memory['W'] = W + h = np.matmul(x, W) + return h + + def backward(self, grad_y): + """ + grad_y: shape(N, d') + """ + + #################### + # code 1 # + #################### + + grad_x = np.matmul(grad_y, self.memory['W'].T) + grad_W = np.matmul(self.memory['x'].T, grad_y) + + return grad_x, grad_W + + +class Relu(NumpyOp): + + def forward(self, x): + self.memory['x'] = x + return np.where(x > 0, x, np.zeros_like(x)) + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 2 # + #################### + + x = self.memory['x'] + grad_x = np.where(x > 0, grad_y, np.zeros_like(grad_y)) + + return grad_x + + +class Log(NumpyOp): + + def forward(self, x): + """ + x: shape(N, c) + """ + + out = np.log(x + self.epsilon) + self.memory['x'] = x + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 3 # + #################### + + x = self.memory['x'] + grad_x = (1 / (x + self.epsilon)) * grad_y + + return grad_x + + +class Softmax(NumpyOp): + """ + softmax over last dimension + """ + + def forward(self, x): + """ + x: shape(N, c) + """ + + #################### + # code 4 # + #################### + + self.memory['x'] = x + + exp = np.exp(x - np.max(x, axis=1, keepdims=True)) + out = exp / np.sum(exp, axis=1, keepdims=True) + self.memory['out'] = out + + return out + + def backward(self, grad_y): + """ + grad_y: same shape as x + """ + + #################### + # code 5 # + #################### + + y = self.memory['out'] + temp = np.matmul(grad_y[:, np.newaxis], np.matmul(y[:, :, np.newaxis], y[:, np.newaxis, :])).squeeze(1) + grad_x = -temp + grad_y * y + + return grad_x + + +class NumpyLoss: + + def __init__(self): + self.target = None + + def get_loss(self, pred, target): + self.target = target + return (-pred * target).sum(axis=1).mean() + + def backward(self): + return -self.target / self.target.shape[0] + + +class NumpyModel: + def __init__(self): + self.W1 = np.random.normal(size=(28 * 28, 256)) + self.W2 = np.random.normal(size=(256, 64)) + self.W3 = np.random.normal(size=(64, 10)) + + # 以下算子会在 forward 和 backward 中使用 + self.matmul_1 = Matmul() + self.relu_1 = Relu() + self.matmul_2 = Matmul() + self.relu_2 = Relu() + self.matmul_3 = Matmul() + self.softmax = Softmax() + self.log = Log() + + # 以下变量需要在 backward 中更新。 softmax_grad, log_grad 等为算子反向传播的梯度( loss 关于算子输入的偏导) + self.x1_grad, self.W1_grad = None, None + self.relu_1_grad = None + self.x2_grad, self.W2_grad = None, None + self.relu_2_grad = None + self.x3_grad, self.W3_grad = None, None + self.softmax_grad = None + self.log_grad = None + + # 以下变量在 momentum 中使用 + self.v1 = np.zeros_like(self.W1) + self.v2 = np.zeros_like(self.W2) + self.v3 = np.zeros_like(self.W3) + + # 以下变量在 adam 中使用 + self.m1 = np.zeros_like(self.W1) + self.m2 = np.zeros_like(self.W2) + self.m3 = np.zeros_like(self.W3) + + # 迭代次数 + self.t = 0 + + def forward(self, x): + x = x.reshape(-1, 28 * 28) + + #################### + # code 6 # + #################### + + x = self.matmul_1.forward(x, self.W1) + x = self.relu_1.forward(x) + + x = self.matmul_2.forward(x, self.W2) + x = self.relu_2.forward(x) + + x = self.matmul_3.forward(x, self.W3) + x = self.softmax.forward(x) + + x = self.log.forward(x) + + return x + + def backward(self, y): + #################### + # code 7 # + #################### + + self.log_grad = self.log.backward(y) + + self.softmax_grad = self.softmax.backward(self.log_grad) + self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad) + + self.relu_2_grad = self.relu_2.backward(self.x3_grad) + self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad) + + self.relu_1_grad = self.relu_1.backward(self.x2_grad) + self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad) + + def optimize(self, learning_rate): + self.W1 -= learning_rate * self.W1_grad + self.W2 -= learning_rate * self.W2_grad + self.W3 -= learning_rate * self.W3_grad + + def momentum(self, learning_rate, gamma=0.95): + self.v1 = gamma * self.v1 + (1 - gamma) * self.W1_grad + self.v2 = gamma * self.v2 + (1 - gamma) * self.W2_grad + self.v3 = gamma * self.v3 + (1 - gamma) * self.W3_grad + + self.W1 -= learning_rate * self.v1 + self.W2 -= learning_rate * self.v2 + self.W3 -= learning_rate * self.v3 + + def AdaGrad(self, learning_rate): + eps = 1e-7 + + self.v1 += self.W1_grad ** 2 + self.v2 += self.W2_grad ** 2 + self.v3 += self.W3_grad ** 2 + + self.W1 -= learning_rate * self.W1_grad / (self.v1 ** 0.5 + eps) + self.W2 -= learning_rate * self.W2_grad / (self.v2 ** 0.5 + eps) + self.W3 -= learning_rate * self.W3_grad / (self.v3 ** 0.5 + eps) + + def RMSProp(self, learning_rate, decay_rate=0.999): + eps = 1e-7 + self.v1 = decay_rate * self.v1 + (1 - decay_rate) * np.square(self.W1_grad) + self.v2 = decay_rate * self.v2 + (1 - decay_rate) * np.square(self.W2_grad) + self.v3 = decay_rate * self.v3 + (1 - decay_rate) * np.square(self.W3_grad) + + self.W1 -= learning_rate * self.W1_grad / (np.sqrt(self.v1) + eps) + self.W2 -= learning_rate * self.W2_grad / (np.sqrt(self.v2) + eps) + self.W3 -= learning_rate * self.W3_grad / (np.sqrt(self.v3) + eps) + + def Adam(self, learning_rate, beta1=0.9, beta2=0.999): + self.t += 1 + eps = 1e-8 + self.m1 = beta1 * self.m1 + (1 - beta1) * self.W1_grad + self.m2 = beta1 * self.m2 + (1 - beta1) * self.W2_grad + self.m3 = beta1 * self.m3 + (1 - beta1) * self.W3_grad + + self.v1 = beta2 * self.v1 + (1 - beta2) * self.W1_grad ** 2 + self.v2 = beta2 * self.v2 + (1 - beta2) * self.W2_grad ** 2 + self.v3 = beta2 * self.v3 + (1 - beta2) * self.W3_grad ** 2 + + # 修正 + m1 = self.m1 / (1 - beta1 ** self.t) + m2 = self.m2 / (1 - beta1 ** self.t) + m3 = self.m3 / (1 - beta1 ** self.t) + + v1 = self.v1 / (1 - beta2 ** self.t) + v2 = self.v2 / (1 - beta2 ** self.t) + v3 = self.v3 / (1 - beta2 ** self.t) + + self.W1 -= learning_rate * m1 / (np.sqrt(v1) + eps) + self.W2 -= learning_rate * m2 / (np.sqrt(v2) + eps) + self.W3 -= learning_rate * m3 / (np.sqrt(v3) + eps) diff --git a/assignment-2/submission/17307130243/numpy_mnist.py b/assignment-2/submission/17307130243/numpy_mnist.py new file mode 100644 index 0000000000000000000000000000000000000000..334eb7fca8d126c4d0d4a468004b57f943f9b163 --- /dev/null +++ b/assignment-2/submission/17307130243/numpy_mnist.py @@ -0,0 +1,251 @@ +import matplotlib.pyplot as plt +import numpy as np +from numpy_fnn import NumpyModel, NumpyLoss +from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot +import time + +np.random.seed(16) + +colors = ['lightskyblue', 'sandybrown', 'mediumpurple', 'olivedrab', + 'gold', 'hotpink'] + + +def normalization(dataset): + """ + 对数据进行标准化 + :param dataset: shape (N, d) + :return: (dataset - mean) / std + """ + # 避免溢出 + eps = 1e-8 + temp = dataset - dataset.mean(axis=0) + return temp / (dataset.var(axis=0) + eps) ** 0.5 + + +def mini_batch(dataset, batch_size=128, shuffle=True, normal=False): + data = np.array([datum[0].numpy() for datum in dataset]) + label = np.array([datum[1] for datum in dataset]) + + num = data.shape[0] + idx = np.arange(num) + + # shuffle + if shuffle: + np.random.shuffle(idx) + + batches = [] + for i in range(0, num, batch_size): + batch_data = data[idx[i: i + batch_size]] + batch_label = label[idx[i: i + batch_size]] + + # batch normalization + if normal: + batch_data = normalization(batch_data) + + batches.append((batch_data, batch_label)) + + # return [(data[idx[i: i + batch_size]], label[idx[i: i + batch_size]]) for i in range(0, num, batch_size)] + return batches + + +def get_torch_initialization_numpy(): + bound1 = np.sqrt(6 / (28 * 28)) + bound2 = np.sqrt(6 / 256) + bound3 = np.sqrt(6 / 64) + + W1 = np.random.uniform(-bound1, bound1, (28 * 28, 256)) + W2 = np.random.uniform(-bound2, bound2, (256, 64)) + W3 = np.random.uniform(-bound3, bound3, (64, 10)) + + return W1, W2, W3 + + +def numpy_run(learning_rate=0.1, epoch_number=3, batch_size=128, optimizer='SGD', max_iter=None): + train_dataset, test_dataset = download_mnist() + + model = NumpyModel() + numpy_loss = NumpyLoss() + model.W1, model.W2, model.W3 = get_torch_initialization_numpy() + + train_loss = [] + + epoch_number = epoch_number + learning_rate = learning_rate + + begin = time.time() # 开始时间 + + acc = [] + + iter = 0 + flag = False + + for epoch in range(epoch_number): + for x, y in mini_batch(train_dataset, batch_size=batch_size): + y = one_hot(y) + + y_pred = model.forward(x) + loss = numpy_loss.get_loss(y_pred, y) + + model.backward(numpy_loss.backward()) + # model.optimize(learning_rate) + # model.momentum(learning_rate) + # model.AdaGrad(learning_rate) + # model.RMSProp(learning_rate) + # model.Adam(learning_rate) + if optimizer == 'SGD': + model.optimize(learning_rate) + elif optimizer == 'momentum': + model.momentum(learning_rate) + elif optimizer == 'AdaGrad': + model.momentum(learning_rate) + elif optimizer == 'RMSProp': + model.RMSProp(learning_rate) + elif optimizer == 'Adam': + model.Adam(learning_rate) + + train_loss.append(loss.item()) + iter += 1 + if max_iter and iter > max_iter: + flag = True + break + + x, y = batch(test_dataset)[0] + accuracy = np.mean((model.forward(x).argmax(axis=1) == y)) + acc.append(accuracy) + print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy)) + if flag: + break + + # 结束时间 + end = time.time() + print('learning rate:', learning_rate) + print("time:{:.4f}".format(end - begin)) + + plot_curve(train_loss) + + # idx = np.arange(0, len(train_loss), 50) + #  plot_curve(train_loss[idx]) + simple_train_loss = [] + for i in range(len(train_loss)): + if i % 50 == 0: + simple_train_loss.append(train_loss[i]) + plot_curve(simple_train_loss) + return acc, simple_train_loss + + +""" 学习率的模型效果的影响 """ +alpha = [0.0001, 0.001, 0.01, 0.1, 1, 2, 5] +alpha2 = [0.0001, 0.001, 0.01, 0.1] + + +def learning_rate_expr(): + x = np.arange(1, 16) + plt.grid() + accs = [] + for i in range(len(alpha)): + acc, _ = numpy_run(learning_rate=alpha[i], epoch_number=15) + + # plt.plot(x, acc, label='alpha={}'.format(alpha[i])) + accs.append(acc) + + # figure + for i in range(len(alpha)): + plt.plot(x, accs[i], label='alpha={}'.format(alpha[i])) + + plt.legend() + plt.xlabel('epoch') + plt.ylabel('accuracy') + + plt.savefig('./img/learning_rate.png', dpi=300, bbox_inches='tight') + + +def learning_rate_expr2(): + x = np.arange(1, 51) + plt.grid() + accs = [] + for i in range(len(alpha)): + acc, _ = numpy_run(learning_rate=alpha[i], epoch_number=50) + + # plt.plot(x, acc, label='alpha={}'.format(alpha[i])) + accs.append(acc) + + # figure + for i in range(len(alpha2)): + plt.plot(x, accs[i], label='alpha={}'.format(alpha[i])) + + plt.legend() + plt.xlabel('epoch') + plt.ylabel('accuracy') + + plt.savefig('./img/learning_rate2.png', dpi=300, bbox_inches='tight') + plt.show() + + +"""batch size 对模型效果的影响""" + +batch_sizes = [1, 16, 32, 64, 128, 256, 512, 1024] + + +def batch_size_expr(): + # x = np.arange(1, 16) + plt.grid() + accs = [] + for size in batch_sizes: + acc, _ = numpy_run(learning_rate=0.01, epoch_number=5, batch_size=size, optimizer='SGD', max_iter=5000) + + # figure + # plt.plot(x, acc, label='size={}'.format(size)) + accs.append(acc) + + for i in range(len(batch_sizes)): + plt.plot(accs[i], label='size={}'.format(batch_sizes[i])) + + plt.legend() + plt.xlabel('epoch') + plt.ylabel('accuracy') + plt.savefig('./img/batch_size.png', dpi=300, bbox_inches='tight') + + +""" optimizer对模型效果的影响 """ +opts = ['SGD', 'momentum', 'AdaGrad', 'RMSProp', 'Adam'] + + +def opt_expr(): + x = np.arange(1, 6) + accs = [] + losses = [] + for opt in opts: + acc, loss = numpy_run(learning_rate=0.002, epoch_number=5, batch_size=128, optimizer=opt) + accs.append(acc) + losses.append(loss) + + # figure accuracy + for i in range(5): + plt.plot(accs[i], label=opts[i]) + + plt.grid() + plt.legend() + plt.xlabel("step/50") + plt.ylabel("Accuracy") + plt.savefig("./img/opt_1.png", dpi=300, bbox_inches='tight') + plt.show() + + # figure loss + for i in range(5): + plt.plot(losses[i], label=opts[i]) + + plt.grid() + plt.legend() + plt.xlabel("epoch") + plt.ylabel("Loss Value") + plt.savefig("./img/opt_2.png", dpi=300, bbox_inches='tight') + plt.show() + + +if __name__ == "__main__": + # numpy_run() + # learning_rate_expr() + # learning_rate_expr2() + # batch_size_expr() + # opt_expr() + numpy_run()