diff --git a/assignment-2/submission/18300110042/README.md b/assignment-2/submission/18300110042/README.md new file mode 100644 index 0000000000000000000000000000000000000000..2aedb1aa0ec1150261a4b0b8669fd7044588a416 --- /dev/null +++ b/assignment-2/submission/18300110042/README.md @@ -0,0 +1,606 @@ +# 课程报告 +这是`prml-21-spring/assignment-2` 中 `选题2`的课程报告,我的代码在 [source.py](source.py) 中。 + +报告的内容是理论和实验证明双层 $\text{ReLU}$ 神经网络能够拟合任意有界闭集上的连续函数。 + +这次报告的主要内容如下: + +1. 理论证明 +2. 实验证明 + - 神经网络模型 + - 实现方法1 (split into bumps) + - 实现方法2 (back propagation) + - 拟合三类函数 + - 多项式函数 + - 三角多项式函数 + - 其他函数 + + +## 理论证明 +首先,定义如下函数: + +$$\Psi_{Relu}(t) = \text{ReLU}(t + 0.5) - \text{ReLU}(t - 0.5).$$ + +其中 $t \in \mathcal{R}$; $\text{ReLU}(t) = \max(0, t)$, + +对于 $\Psi_{Relu}(t)$, 也可以看做一个分段线性的函数: + +$$\Psi_{Relu}(t) = \begin{cases} 0, && t \le -0.5 \\\\ t + 0.5, && -0.5 < t < 0.5 \\\\ 1, && t \ge 0.5 \end{cases} .$$ + +如图: + +A squashing function constructed with ReLU + +> lab_number = 1 + +$\Psi_{Relu}: \mathcal{R} \to [0,1]$ 满足条件: + +1. 连续(continuous)非减(non-decreasing)函数; +2. $\lim_{\lambda \to +\infty} \Psi_{Relu}(\lambda) = 1$; +3. $\lim_{\lambda \to -\infty} \Psi_{Relu}(\lambda) = 0$. + +根据通用近似定理 [Hornik et al., 1989](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators") 中的 `Definition 2.3`,$\Psi_{Relu}(\cdot)$ 是一个 `squashing function` (挤压函数) . + +由【[Hornik et al., 1989](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators")】`Theorem 2.4`,通用近似定理在有界闭集上成立,对于一个 `squashing function` $\Psi(\cdot)$ ,令 $\mathcal{I}_r$ 为 $\mathcal{R}^r$ 上的任一有界闭集, $\mathcal{C}(\mathcal{I}_r)$ 为 $\mathcal{I}_r$ 中连续函数的集合,则函数集合 + +$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{N} \alpha_{j} \Psi(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j), \space \boldsymbol{w} \in \mathcal{R}^r, \alpha\_j, b\_j \in \mathcal{R}, N \in \mathcal{N_+} \\}$$ + +在 $\mathcal{C}(\mathcal{I}_r)$ 上是一致稠密的(uniformly dense)。 + +也就是说, + +对于 $\forall f \in \mathcal{C}(\mathcal{I}_r), \space \exists N \in \mathcal{N\_+}, \space \alpha_j, b_j \in \mathcal{R}, \space \boldsymbol{w}_j \in \mathcal{R}^r, (j = 1, \cdots, N)$ ,使得函数 + +$$G(\boldsymbol{x}) = \sum_{j=1}^{N} \alpha_{j} \Psi(\boldsymbol{w}_j^T \boldsymbol{x} + b_j)$$ + +可以以任意精度近似函数 $f$ ,即: + +对于 $\forall \epsilon > 0$ ,有 + +$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{N} {\alpha_{j} \Psi(\boldsymbol{w}_j^T \boldsymbol{x} + b_j)} - f(\boldsymbol{x})| < \epsilon, \space \forall \boldsymbol{x} \in \mathcal{I}_r.$$ + +将 $\Psi_{Relu}(\boldsymbol{x}) = \text{ReLU}(\boldsymbol{x} + 0.5) - \text{ReLU}(\boldsymbol{x} - 0.5)$ 代入上式,即可得: + +$$\begin{align} |G(\boldsymbol{x}) - f(\boldsymbol{x})| &= |\sum_{j=1}^{N} {\alpha_{j} \Psi\_{Relu}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j)} - f(\boldsymbol{x})| \\\\ &= |\sum_{j=1}^{N}{\alpha_j [\text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j + 0.5) - \text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j - 0.5)]} - f(\boldsymbol{x})| \\\\ &=|\sum_{j=1}^{N}{\alpha_j \text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j + 0.5)} - \sum_{j=1}^{N}{\alpha_j \text{ReLU}(\boldsymbol{w}_j^T \boldsymbol{x} + b_j - 0.5)} - f(\boldsymbol{x})| \\\\ < \epsilon, \end{align}$$ + + +进一步调整,可以得到: + +$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{2N} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)} - f(\boldsymbol{x})| < \epsilon, $$ + + +其中 +$$\tilde{\alpha}_j = \begin{cases} \alpha_j, && 1 \le j \le N \\\\ -\alpha_j, && N < j \le 2N \end{cases}, $$ + +$$\tilde{\boldsymbol{w}}_j = \begin{cases} \boldsymbol{w}_j, && 1 \le j \le N \\\\ \boldsymbol{w}\_{j-N}, && N < j \le 2N \end{cases}, $$ + +$$\tilde{b}_j = \begin{cases} b_j + 0.5, && 1 \le j \le N \\\\ b_j - 0.5, && N < j \le 2N \end{cases}. $$ + +记 $M = 2N \in \mathcal{N_+} $ ,即可得: + +$$G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)},$$ + +于是, + +对于 $\forall \epsilon > 0, \space \forall f \in \mathcal{C}(\mathcal{I}_r), \space \exists M \in \mathcal{N\_+}, \space \alpha_j, b_j \in \mathcal{R}, \space \boldsymbol{w}_j \in \mathcal{R}^r, (j = 1, \cdots, N)$ ,使得 + +$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)} - f(\boldsymbol{x})| < \epsilon, \space \forall \boldsymbol{x} \in \mathcal{I}_r. $$ + +即集合 + +$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}\_j^T \boldsymbol{x} + \tilde{b}\_j)}, \space \boldsymbol{w} \in \mathcal{R}^r, \alpha\_j, b\_j \in \mathcal{R}, M \in \mathcal{N_+}\\}$$ + +在 $\mathcal{C}(\mathcal{I}_r)$ 上是一致稠密的。 + +该集合正是两层 $\text{ReLU}$ 神经网络能够实现的所有函数的集合,其中 $M \in \mathcal{N_+}$ 即隐藏层的神经元数量。 + +也就是说,对于两层 $\text{ReLU}$ 神经网络,我们一定能够通过有限个数($M \in \mathcal{N_+}$)的隐藏层神经元,以任意精度近似任何一个定义在实数空间 $\mathcal{R}^r$ 中的有界闭集 $\mathcal{I}_r$ 上的连续函数。 + +

$\blacksquare$

+ + + +## 实验证明 +通过以上的理论证明,我们知道两层 $\text{ReLU}$ 神经网络理论上能够以任意精度近似每一个定义在实数空间 $\mathcal{R}^r$ 上有界闭集 $\mathcal{I}_r$ 上的连续函数。但在实践中,如何调整神经网络的参数来完成这个任务,还需要进一步的探讨。 + +我们在下面尝试采用两种不同的方法来拟合给定的函数: + +1. 根据理论证明,通过几个 $\text{ReLU}$ 的神经元,在每一个给定的区间上对函数值进行拟合; +2. 根据常用的神经网络调整参数的方法——梯度下降,拟合对应区间内给定的函数。 + +目标是要尽可能好地拟合给定的函数。 + +### 神经网络模型 +根据证明,我们设计如下神经网络: +1. 输入 $\boldsymbol{x} \in \mathcal{R}^D$ ,经过一个 `affine` 层,得到 $$\boldsymbol{z} = \boldsymbol{x} \boldsymbol{w} + \boldsymbol{b} \in \mathcal{R}^M; $$ + +2. 而后通过 `ReLU` 激活层,得到 $$\boldsymbol{h} = \text{ReLU} (\boldsymbol{z}) = \max(\boldsymbol{z}, \boldsymbol{0}) \in \mathcal{R}^M; $$ + +3. 最后再通过一个线性映射,得到最终的输出 $$y = \boldsymbol{h} \boldsymbol{\alpha} \in \mathcal{R}$$ + +其中,$D, M \in \mathcal{N_+}$ ,$D$ 为输入的维度, $M$ 为隐藏层神经元个数, $\boldsymbol{w_1}$ 为 $D \times M$ 的矩阵, $\boldsymbol{w_2}$ 为 $M$ 维的向量。 + +模型大致结构如下: + +Two-layer ReLU Neural Network model + +这个神经网络能够实现的函数集合就是证明中的 +$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}\_j^T \boldsymbol{x} + \tilde{b}\_j)}, \space \boldsymbol{w} \in \mathcal{R}^D, \alpha\_j, b\_j \in \mathcal{R}, M \in \mathcal{N_+}\\}.$$ + +我们的神经网络模型能够处理任意维度( $D$ )的输入,但在接下去的实验中,为了方便可视化,实现的是针对 $1$ 维输入的拟合,即 $\mathcal{R} \to \mathcal{R}$ 的函数。 + + +### 实现方法 1 —— split into bumps +为了拟合一个定义在闭区间 $[a, b]$ 上函数 $F(x)$ ,可以将闭区间 $[a, b]$ 分割为有限个子区间(subinterval),即在 $[a, b]$ 间找到有限个点: + +$$a = x_0 < x_1 < x_2 < \cdots < x_n = b, $$ + +形成 $N$ 个子区间 $\Delta = [x_i, x_{i+1}]$ ,对该区间上的函数值进行拟合。 + +记该区间长度为 $\delta$ ,当 $\delta \to 0$ 时,我们就可以精确的近似函数 $F(x)$ . + +这和 [Riemann integral](https://en.wikipedia.org/wiki/Riemann_integral "Riemann integral") 的概念类似。 + +具体而言,我们将给定的闭区间分割为等长的 $N$ 个子区间,令每一个子区间上的预测值等于该区间中点处的函数值。 + +我们定义如下的 `bump function` : +$$b(x_1, x_2, h) = \begin{cases} 0, && x \le x_1 \\\\ h, && x_1 < x < x_2 \\\\ 0, && x \ge x_2 \end{cases}, $$ + +其中 $x_1 < x_2$ . + +`bump` 的宽度是 $\delta = x_2 - x_1$ ,高度是 $h$ 。 + +根据理论证明,我们可以通过两个 $\text{ReLU}$ 函数(即两个隐藏层的神经元)来实现一个挤压函数(`squashing function`);而实际上,通过两个挤压函数就可以近似一个 `bump function` ,通过控制每一个的 `bump` 的高度和宽度,我们可以在任意小的区间 $\Delta$ 上通过 `bump` 来近似该区间对应的函数值。 + +近似 `bump function` 的方法: + +通过以下 $4$ 个 $\text{ReLU}$ 函数近似给定的 `bump function` $b(x_1, x_2, h)$ : + +$$\begin{align} y_1 &= \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU}(x - x_1) \\\\ y_2 &= - \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} [x - (x_1 +\frac{x_2 - x_1}{d})] \\\\ y_3 &= - \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} [x - (x_2 - \frac{x_2 - x_1}{d})] \\\\ y_4 &= \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} (x - x_2) \end{align}$$ + +最终得到 $y = \sum_{i=1}^{4} {y_i}$ 作为 `bump function` 的近似。 + +其中 $h$ 即 `bump function` 的高度; + +$d$ 决定了 $\text{ReLU}$ 函数的斜率: `bump` 的宽度为 $x_2 - x_1$ ,而在其两侧的间断点 $x_1, x_2$ 处,$\text{ReLU}$ 函数需要通过斜率较大的直线近似,我们令这个区间的长度为 $\frac{x_2 - x_1}{d}$ 。 + +于是,当 $d \to + \infty$ 时, $\text{ReLU}$ 函数的斜率趋近于 $+ \infty$ ,近乎垂直于横轴,能够无限接近间断点处的函数。 + + +我们实现 `bump` 的模型如下: + +bump model + +得到函数: +$$y = \sum_{i=1}^{4} {y_i} , $$ +$$y_i = a_i \cdot \text{ReLU} (w_i x + b_i), \space i = 1,2,3,4$$ + +其中, + +$$w_1 = w_2 = w_3 = w_4 = 1 ; $$ + +$$\begin{align} a_1 &= a_4 = \frac{d \times h}{x_2 - x_1} \\\\ a_2 &= a_3 = -\frac{d \times h}{x_2 - x_1} ; \end{align}$$ + +$$\begin{align} b_1 &= -x_1 \\\\ b_2 &= - (x_1 +\frac{x_2 - x_1}{d}) \\\\ b_3 &= - (x_2 - \frac{x_2 - x_1}{d}) \\\\ b_4 &= - x_2 . \end{align}$$ + +如图,为我们通过 $4$ 个 $\text{ReLU}$ 函数实现的一个 `bump` : + +bump + +> lab_number = 2 + +这样,通过 $4N$ 个隐藏层的神经元,拟合 $N$ 个 `bump` ,可以拟合任意有界闭区间上连续的函数 $F(x)$ ,当 $N \to \infty, \space \delta \to 0$ 时,神经网络实现的函数 $G(x)$ 可以无限接近 $F(x)$ . + +***Reference*** + +该方法参考 [Michael Nielsen (2019). Neural Networks and Deep Learning.](http://neuralnetworksanddeeplearning.com/chap4.html "Neural Networks and Deep Learning") + + +### 实现方法 2 —— Backpropagation +在 ***实现方法1*** 中,我们通过计算的方式设置了神经网络的参数,这种方式很容易理解,但需要非常多的 `bump` ( $N \to \infty$ )来精确的拟合函数,对应的神经元数量是 `bump` 数量的 $4$ 倍( $4N \to \infty$ ),原因就在于没有充分利用全连接神经网络中的参数。 + +在这里,对于相同结构的神经网络模型,我们使用反向传播算法自动训练神经网络的参数。具体步骤为: + +1. 通过定义的函数产生样本数据; +2. 前向传播后得到损失(采用 `mini-batch` 的方式); +3. 根据损失计算梯度,通过反向传播调整模型参数. + +#### 参数初始化 +由于在 Backpropagation 的方法中,神经网络参数的初始化对最终结果有一定影响,考虑到神经网络的激活函数是 $\text{ReLU}$ 函数,采用 $\text{Kaiming}$ 初始化的方法(参考 [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/pdf/1502.01852.pdf "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"))。 + +方法如下: + +记第 $l$ 层的参数为 $\boldsymbol{w_l}$ ,该层的神经元个数为 $n_l$ ,令 +$$\boldsymbol{w_l} \sim \mathcal{N}(0, \sqrt{\frac{2}{n_l}}).$$ + +与论文中略微不同,实验中第一层网络中的 $\boldsymbol{b}$ 也通过这个方式进行了初始化。 + +具体的,我们通过 `np.random.randn` 生成服从 $\mathcal{N}(0, 1)$ 的数,而后乘以 $\sqrt{\frac{2}{n_l}}$ 来调整方差。 + +#### Loss Function +对于回归问题,常见的损失函数有 `MSE (Mean Squared Error)` 、 `MAE (Mean Absolute Error)` 和 `Huber Loss`. + +##### MSE (Mean Squared Error) +`MSE` 衡量的是模型预测值与真实值之间的距离平方的平均值。 +记 $\hat{y}$ 为模型预测值, $y$ 为函数真实值,对于 $N$ 个样本, +$$MSE = \frac{1}{N} \sum_{i=1}^{N}{(\hat{y}_i - y_i)^2}. $$ + +`MSE` 曲线光滑、连续、可导,且随着误差的减小,梯度也会减小,便于收敛;但是对于一些异常值较为敏感,可能会降低模型的性能。 + +##### MAE (Mean Absolute Error) +相对应的, +$$MAE = \frac{1}{N} \sum_{i=1}^{N}{|\hat{y}_i - y_i|}.$$ + +`MAE` 连续但在 $\hat{y}_i - y_i = 0$ 处不可导,且其他区域处处梯度相同,不利于函数收敛以及梯度下降的学习;但另一方面,由于不同大小的误差对应的梯度相同, `MAE` 对异常值不太敏感,可以弥补 `MSE` 这方面的问题. + +##### Huber Loss +`Huber Loss` 将 `MSE` 和 `MAE` 结合起来,公式为: +$$\mathcal{L}_{\lambda} (\hat{y}, y) = \begin{cases} \frac{1}{2}(\hat{y}_i - y_i)^2, && |\hat{y}_i - y_i| \le \lambda \\\\ \lambda |\hat{y}_i - y_i| - \frac{1}{2} \lambda ^2, && |\hat{y}_i - y_i| > \lambda \end{cases}$$ + +其中,超参数 $\lambda$ 决定了对 `MSE` 和 `MAE` 的侧重性,当 $\lambda$ 较大时,我们更多的使用 `MSE` ,当 $\lambda$ 较小时,我们更多的使用 `MAE` . + +在实验中,我们采用 `Huber Loss` 作为损失函数,令 $\lambda = 1.0$ . + +#### 参数优化方法 +在实验中,尝试了两种优化参数的算法: `SGD` (随机梯度下降)和 `Adam` (参考 [Adam](https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c#:~:text=Adam%20is%20an%20adaptive%20learning%20rate%20method%2C%20which,rate%20for%20each%20weight%20of%20the%20neural%20network. "Adam") )。 + +对于 `Adam` 的参数,设定 $\beta_1 = 0.9, \space \beta_2 = 0.999.$ + +在实验中统一使用了 `Adam` 。 + +#### Grid Search +实现的代码可以对 `learning_rate` , `batch_size` , `hidden_size` 等超参数进行网格搜索。 + +搜索时验证集占整个训练集数据的 $20\%$ . + +#### 评价方式 +我们通过给定的函数先生成 $10,000$ 个样本数据,将其中的 $20\%$ 的数据划为测试集,而后对模型在测试集上的表现进行评价。(最终选定的模型会训练 `50` 个 `epoch` )。 + +选择 `R squared (coefficient of determination)` (也叫 `Proportion of Variance Explained` )对拟合结果进行评价,具体计算方法为: + +$$\begin{align}R^2 &= 1 - \frac{\text{Unexplained Variation}}{\text{Total Variation}} \\\\ &= 1 - \frac{\sum_i{(\hat{y}_i - y_i})^2}{\sum_i{(y_i - \bar{y})^2}} \\\\ &= 1 - \frac{MSE}{Variance}\end{align}$$ + +其中, $\hat{y}_i$ 为模型的预测值; $y_i$ 为真实值; $\bar{y}$ 为真实值的均值。 + +$R^2 \in (-\infty, 1]$ , $R^2$ 的值越大,说明模型预测效果越好; + +当 $R^2 = 1$ 时,模型的预测值全都等于真实值;当 $R^2 = 0$ 时,一种可能的情况是“模型简单的预测所有的值都等于 $y$ 的平均值”; + +$R^2$ 一般在 $[0, 1]$ 之间,当 $R^2 < 0$ 时,模型预测非常差,可能用了错误模型或模型假设不合理。 + +$R^2$ 因为去除了真实值本身方差 $Variance$ 的因素,因而可以在不同的情况下进行比较。 + + + +### 拟合三类函数 +这里对以下三类函数进行拟合: +1. 多项式函数 +2. 三角多项式函数 +3. 其他函数 + +选择这三类函数的原因是: + +根据我已有的数学知识,通过 $\text{Taylor}$ 级数(多项式函数) 和 $\text{Fourier}$ 级数(三角多项式函数)可以拟合大多数常见的函数; + +而其他函数中则包括了一些较为少见的 `pathological functions` ,来进一步测试神经网络的拟合能力。 + +实验中对每一个给定的函数观察在 $[-2, 2]$ 闭区间内的拟合情况。 + +对于 `实现方法1 (split into bumps)` ,统一规定将区间分割为 $30$ 个子区间。 + +#### 多项式函数 +##### Taylor 公式 +根据 [Taylor's Theorem](https://en.wikipedia.org/wiki/Taylor%27s_theorem "Taylor's Theorem") ,若函数 $f$ 在 $x_0$ 的某个邻域 $O(x_0, r)$ 上具有 $n+1$ 阶导数,且其 $n+1$ 阶余项收敛于 $0$ ,那么我们可以通过 $\text{Taylor}$ 级数的部分和,即 $n$ 次 $\text{Taylor}$ 多项式 +$$f(x_0) + f'(x_0)(x - x_0) + \frac{f''(x_0)}{2!}(x - x_0)^2 + \cdots + \frac{f^{(n)}(x_0)}{n!}(x - x_0)^n$$ +在 $O(x_0, r)$ 上近似 $f(x)$ 。 + +虽然 $\text{Taylor}$ 公式要求函数 $n+1$ 阶可导,且只是局部近似,但根据 [Weierstrass Approximation Theorem](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem#Weierstrass_approximation_theorem "Weierstrass Approximation Theorem") ,任意一个在有界闭区间 $[a, b]$ 上连续的实函数 $f(x)$ 都可以展开成在这个区间上一致收敛的多项式函数,即: +对于 $\forall \epsilon > 0$ ,存在一个多项式函数 $P(x)$ ,使得 +$$|P(x) - f(x)| < \epsilon, \space x \in [a, b].$$ + +因此,若两层 $\text{ReLU}$ 神经网络能够拟合任意的多项式函数,那么它就能够拟合任意有界闭区间上的连续函数。 + +##### 实验拟合函数 +我们选择以下两个多项式函数进行拟合: +1. $$f(x) = 1 - x + x^2 - x^3 + x^4 - x^5 + x^6 - x^7$$ + +***实现方法1*** + +lab03 + +> lab_number = 3 + +***实现方法2*** + +经过网格搜索(每组超参数训练 `6` 个 `epoch`)后,选择的超参数为: + +| Hyperparameter | Value | +| :--------: | :-----: | +| hidden_size | 1000 | +| batch_size | 90 | +| learning_rate | 0.01 | + +模型学习过程中的损失 `loss` 变化如下: + +lab09_loss + +拟合函数情况如下: + +lab09 + +> lab_number = 9 + +最终的损失 +$$R^2 = 0.99984384$$ +非常接近于 $1$ ,可以判断模型拟合的非常好。 + + +2. $$f(x) = x - \frac{x^3}{3} + \frac{x^5}{5} - \frac{x^7}{7} + \frac{x^9}{9} - \frac{x^{11}}{11}$$ + +***实现方法1*** + +lab04 + +> lab_number = 4 + +***实现方法2*** + +经过网格搜索(每组超参数 `6` 个 `epoch`)后,选择的超参数为: + +| Hyperparameter | Value | +| :--------: | :-----: | +| hidden_size | 1000 | +| batch_size | 30 | +| learning_rate | 0.01 | + +模型学习过程中的损失 `loss` 变化如下: + +lab10_loss + +拟合函数情况如下: + +lab10 + +> lab_number = 10 + +最终的损失 +$$R^2 = 0.99880380$$ +非常接近于 $1$ 。 + + +#### 三角多项式函数 +##### Fourier 级数 +根据 [Fourier Series](https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118548448.app1 "Fourier Series") ,若函数 $f$ $\text{Riemann}$ 可积,且在给定的周期 $[-T,T]$ 内,满足下两个条件之一: +1. 分段单调且有界; +2. 分段可导, + +则 $f$ 的 $\text{Fourier}$ 展开 +$$f(x) \sim \frac{a_0}{2} + \sum_{n=1}^{\infty}{(a_n \cos nx + b_n \sin nx)}, $$ + +在每一点均收敛,具体的,在 $x$ 点处收敛于 +$$\frac{f(x-0) + f(x+0)}{2}.$$ + +也就是说,当 $x$ 是 $f$ 的连续点时,$\text{Fourier}$ 级数收敛于 $f(x)$ ;对于左右极限存在的间断点,级数收敛于左右极限的算术平均值。 + +因而可以通过 $\text{Fourier}$ 级数的部分和,近似函数 $f$ . + +若两层 $\text{ReLU}$ 神经网络能够拟合任意的三角多项式函数,它就能够以任意精度拟合有界闭区间上的连续函数。 + +##### 实验拟合函数 +我们在这里尝试拟合各种类型的三角级数($\text{Fourier}$ 级数)的部分和函数。 + +与 $\text{Taylor}$ 级数不同,我们选择一些本身有有限间断点、或不可导的函数展开形成的 $\text{Fourier}$ 级数。 + +1. $$f(x) = \frac{1}{2} - \frac{2}{\pi} \sum_{n=1}^{10} {\frac{\sin (2n+1)x}{2n+1}}$$ + +该函数是 $F(x) = \begin{cases} 1, && x \in [-\pi, 0) \\\\ 0, && x \in [0, \pi)\end{cases}$ 的 $\text{Fourier}$ 展开的部分和。 + +***实现方法1*** + +lab05 + +> lab_number = 5 + +***实现方法2*** + +经过网格搜索(每组超参数 `20` 个 `epoch`)后,选择的超参数为: + +| Hyperparameter | Value | +| :--------: | :-----: | +| hidden_size | 1000 | +| batch_size | 30 | +| learning_rate | 0.0001 | + +模型学习过程中的损失 `loss` 变化如下: + +lab11_loss + +拟合函数情况如下: + +lab11 + +> lab_number = 11 + +对于这个函数,一开始网格搜索(每组超参数 `6` 个 `epoch`)时,训练结果并不理想,原因在于 `6` 个 `epoch` 训练不足,较小的学习率(梯度下降的 `step` )收敛较慢,表现不好;(在之前的实验中,网格搜索也都选择了最大的学习率);而由于该函数波动比较大,学习率不能过大,否则很容易跳过最优点。 + +因而我们调整了网格搜索中每组超参数训练的 `epoch` 数量,最终得到了更好的结果: +$$R^2 = 0.99973780$$ +非常接近于 $1$ 。 + + +2. $$f(x) = \sum_{n=0}^{10} {\frac{1}{2^n} \cos (17^n \pi x)}$$ + +这是 [Weierstrass function](https://en.wikipedia.org/wiki/Weierstrass_function "Weierstrass function") 的部分和。 + +具体的,令 $a \in (0,1)$ , $b$ 为奇数,且 $ab > 1 + \frac{3 \pi}{2}$ ,则 $\text{Weierstrass}$ 函数 +$$F(x) = \sum_{n=0}^{\infty} {a^n \cos (b^n \pi x)}$$ +绝对收敛,且处处连续,处处不可导. + +这里选择 $a = \frac{1}{2}, \space b = 17, $ 考察神经网络对级数部分和的拟合情况: +$$f(x) = \sum_{n=0}^{10} {(\frac{1}{2})^n \cos (17^n \pi x)}$$ + + +***实现方法1*** + +lab06 + +> lab_number = 6 + +***实现方法2*** + + +经过网格搜索(每组超参数 `6` 个 `epoch`)后,选择的超参数为: + +| Hyperparameter | Value | +| :--------: | :-----: | +| hidden_size | 1000 | +| batch_size | 90 | +| learning_rate | 0.001 | + + +模型学习过程中的损失 `loss` 变化如下: + +lab12_loss + +拟合函数情况如下: + +lab12 + +> lab_number = 12 + +最终的损失 +$$R^2 = 0.10848555.$$ + +这个函数由于波动太剧烈,非常难学习。 + +对于 ***实现方法1*** ,通过加大 `bump` 的数量 $N$ ,可以更好的拟合这个函数; + +但对于 ***实现方法2*** ,我们改变了网格搜索时训练的 `epoch` 数量、样本数据大小、学习率等超参数,仍然难以很好的进行拟合,说明当遇到这种波动非常大,甚至在极端情况下处处不可导的函数时,通过神经网络进行拟合还是有一定困难的。 + + +#### 其他函数 +我们进一步挑选了一些 `pathological functions (病态函数)` 尝试拟合。 + +1. Mollifier function +$$f(x) = \begin{cases} \exp\\{- \frac{1}{1 - |x|^2}\\}, && |x| < 1 \\\\ 0, && |x| \ge 1 \end{cases}, $$ + +[Mollifier](https://en.formulasearchengine.com/wiki/Mollifier "Mollifier") 函数无穷阶可导,但在 $|x| = 1$ 处导数消失,因而不解析,其展开的 $\text{Taylor}$ 级数不收敛于该函数。 + +***实现方法1*** + +lab07 + +> lab_number = 7 + +***实现方法2*** + + +经过网格搜索(每组超参数 `6` 个 `epoch`)后,选择的超参数为: + +| Hyperparameter | Value | +| :--------: | :-----: | +| hidden_size | 100 | +| batch_size | 60 | +| learning_rate | 0.01 | + +模型学习过程中的损失 `loss` 变化如下: + +lab13_loss + +拟合函数情况如下: + +lab13 + +> lab_number = 13 + +最终的损失 +$$R^2 = 0.99987666$$ + + +2. Non-analytic smooth function +$$f(x) = \begin{cases} \exp \\{- \frac{1}{x}\\}, && x > 0 \\\\ 0, && x \le 0 \end{cases}$$ + +[Non-analytic smooth function](https://en.wikipedia.org/wiki/Non-analytic_smooth_function#A_smooth_function_which_is_nowhere_real_analytic "Non-analytic smooth function") 在实数轴上处处无穷阶可导且导数连续,但不解析,由于其原点处的导数为 $0$ ,该点处的 $\text{Taylor}$ 级数收敛于 $0$ ,即: +$$\sum_{n=0}^{\infty} {\frac{f^{(n)}(0)}{n!} x^n} = \sum_{n=0}^{\infty}{\frac{0}{n!} x^n} = 0, \space x \in \mathcal{R}$$ + +***实现方法1*** + +lab08 + +> lab_number = 8 + +***实现方法2*** + + +经过网格搜索(每组超参数 `6` 个 `epoch`)后,选择的超参数为: + +| Hyperparameter | Value | +| :--------: | :-----: | +| hidden_size | 100 | +| batch_size | 30 | +| learning_rate | 0.01 | + +模型学习过程中的损失 `loss` 变化如下: + +lab14_loss + + +拟合函数情况如下: + +lab14 + +> lab_number = 14 + +最终的损失 +$$R^2 = 0.99988132$$ + +可以看到,对于这种光滑的函数,即使不解析,不能通过 $\text{Taylor}$ 级数近似,神经网络的拟合情况依然很好。 + + +## 总结 +- 第一部分通过理论证明了,双层 $\text{ReLU}$ 神经网络具有近似任意有界闭集上的连续函数的能力。 + +- 而后尝试了两种实现的方法: + - `split into bumps` ,即将区间分割成 $N$ 个子区间,而后对每个子区间上的函数值进行拟合,即拟合不同的 `bump` 。 + + - 拟合的具体方法为 $$4 \times \text{ReLU neuron} \longrightarrow 2 \times \text{squashing function} \longrightarrow 1 \times \text{bump}.$$ + + - 因而隐藏层的神经元( $4N$ )越多,对函数拟合得越好; + + - 这种方法便于理解,可解释性较强,但需要极大数量的隐藏层神经元来对函数进行拟合,没有充分利用全连接神经网络中的参数。 + + - `Backpropagation` ,即常用的训练神经网络模型的方法,通过标注数据以及给定的损失函数,自动训练模型的参数 + + - 这种方法更充分的利用了模型中的参数,通过较少的隐藏层神经元实现拟合; + + - 在训练过程中,参数的初始化、训练时的学习率等都会对模型的拟合产生很大的影响;如何更好的训练模型,使模型更高效的收敛,还有很大探究的空间; + + - 对于波动较为剧烈的函数,通过 `Backpropagation` 的方法较难拟合; + + - 通过这种方法训练时,不同的超参数、参数对模型表现的影响较难分析,模型可解释性较差。 + + +## Reference + +[1]. [Hornik et al., 1989. Multilayer Feedforward Networks are Universal Approximators.](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators") + +[2]. [Michael Nielsen, 2019. Neural Networks and Deep Learning.](http://neuralnetworksanddeeplearning.com/chap4.html "Neural Networks and Deep Learning") + +[3]. [K. He, X. Zhang, S. Ren and J. Sun, 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.](https://arxiv.org/pdf/1502.01852.pdf "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification") + +[4]. [Weierstrass Approximation Theorem.](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem#Weierstrass_approximation_theorem "Weierstrass Approximation Theorem") + +[5]. [Mollifier function.](https://en.formulasearchengine.com/wiki/Mollifier "Mollifier") + +[6]. [Non-analytic smooth function.](https://en.wikipedia.org/wiki/Non-analytic_smooth_function#A_smooth_function_which_is_nowhere_real_analytic "Non-analytic smooth function") + +[7]. [Weierstrass function.](https://en.wikipedia.org/wiki/Weierstrass_function "Weierstrass function") + +[8]. [Riemann integral.](https://en.wikipedia.org/wiki/Riemann_integral "Riemann integral") + + +## 代码使用方法 +``` +python source.py lab_number #lab_number = 1, 2, ..., 14 + e.g., python source.py 1 +``` \ No newline at end of file diff --git a/assignment-2/submission/18300110042/img/bump_model.png b/assignment-2/submission/18300110042/img/bump_model.png new file mode 100644 index 0000000000000000000000000000000000000000..b163c5d1ac0ca1bcd383c1c2c220069d1f56446d Binary files /dev/null and b/assignment-2/submission/18300110042/img/bump_model.png differ diff --git a/assignment-2/submission/18300110042/img/lab01.png b/assignment-2/submission/18300110042/img/lab01.png new file mode 100644 index 0000000000000000000000000000000000000000..7b048b4ddaffa996adb312786b1f8adbea953f39 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab01.png differ diff --git a/assignment-2/submission/18300110042/img/lab02.png b/assignment-2/submission/18300110042/img/lab02.png new file mode 100644 index 0000000000000000000000000000000000000000..249c273af2d5b1322aaeb2b5455feaa318030f1e Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab02.png differ diff --git a/assignment-2/submission/18300110042/img/lab03.png b/assignment-2/submission/18300110042/img/lab03.png new file mode 100644 index 0000000000000000000000000000000000000000..8f1bb896a573b309a0d60b52172c6d87777d6ce4 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab03.png differ diff --git a/assignment-2/submission/18300110042/img/lab04.png b/assignment-2/submission/18300110042/img/lab04.png new file mode 100644 index 0000000000000000000000000000000000000000..3f2b420672a840f8396f1325ba0f559af9aba757 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab04.png differ diff --git a/assignment-2/submission/18300110042/img/lab05.png b/assignment-2/submission/18300110042/img/lab05.png new file mode 100644 index 0000000000000000000000000000000000000000..1922301cd722f868955da6882914b67d64c0c8ee Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab05.png differ diff --git a/assignment-2/submission/18300110042/img/lab06.png b/assignment-2/submission/18300110042/img/lab06.png new file mode 100644 index 0000000000000000000000000000000000000000..6806f3108bac9280a260da9824ee1a42a280de20 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab06.png differ diff --git a/assignment-2/submission/18300110042/img/lab07.png b/assignment-2/submission/18300110042/img/lab07.png new file mode 100644 index 0000000000000000000000000000000000000000..dabb0e653c791eced71ca3cf466ec9a7e3fa736d Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab07.png differ diff --git a/assignment-2/submission/18300110042/img/lab08.png b/assignment-2/submission/18300110042/img/lab08.png new file mode 100644 index 0000000000000000000000000000000000000000..a2e9f57904639f48d29641eb6b9155bd39556dc7 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab08.png differ diff --git a/assignment-2/submission/18300110042/img/lab09.png b/assignment-2/submission/18300110042/img/lab09.png new file mode 100644 index 0000000000000000000000000000000000000000..1939508d083254cd8292fbd4ca705716ea222de2 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab09.png differ diff --git a/assignment-2/submission/18300110042/img/lab09_loss.png b/assignment-2/submission/18300110042/img/lab09_loss.png new file mode 100644 index 0000000000000000000000000000000000000000..c578d1c61a1fa6c2b3dd72dbb0ef98d59dc0558b Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab09_loss.png differ diff --git a/assignment-2/submission/18300110042/img/lab10.png b/assignment-2/submission/18300110042/img/lab10.png new file mode 100644 index 0000000000000000000000000000000000000000..f7cbd990e0eaf241702568e73861959594fc9146 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab10.png differ diff --git a/assignment-2/submission/18300110042/img/lab10_loss.png b/assignment-2/submission/18300110042/img/lab10_loss.png new file mode 100644 index 0000000000000000000000000000000000000000..74ddd6a1a0f93455f1a5e2235a37f7acae496408 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab10_loss.png differ diff --git a/assignment-2/submission/18300110042/img/lab11.png b/assignment-2/submission/18300110042/img/lab11.png new file mode 100644 index 0000000000000000000000000000000000000000..bac94610d4890e9d106c0a9d3c79bde4a358f694 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab11.png differ diff --git a/assignment-2/submission/18300110042/img/lab11_loss.png b/assignment-2/submission/18300110042/img/lab11_loss.png new file mode 100644 index 0000000000000000000000000000000000000000..eabb70779d0bba17a25f755a8395f8120b8c7b2c Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab11_loss.png differ diff --git a/assignment-2/submission/18300110042/img/lab12.png b/assignment-2/submission/18300110042/img/lab12.png new file mode 100644 index 0000000000000000000000000000000000000000..a5cdf8a04098f453dd1ae9053ada6c5bbb238780 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab12.png differ diff --git a/assignment-2/submission/18300110042/img/lab12_loss.png b/assignment-2/submission/18300110042/img/lab12_loss.png new file mode 100644 index 0000000000000000000000000000000000000000..672faf1adcdd6ff97d55a88cf03e43223f963f08 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab12_loss.png differ diff --git a/assignment-2/submission/18300110042/img/lab13.png b/assignment-2/submission/18300110042/img/lab13.png new file mode 100644 index 0000000000000000000000000000000000000000..4e418b9a656be8f0b8624067923fe33a25897cf7 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab13.png differ diff --git a/assignment-2/submission/18300110042/img/lab13_loss.png b/assignment-2/submission/18300110042/img/lab13_loss.png new file mode 100644 index 0000000000000000000000000000000000000000..479460dfb7d5262536277da01af1d182ef798fde Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab13_loss.png differ diff --git a/assignment-2/submission/18300110042/img/lab14.png b/assignment-2/submission/18300110042/img/lab14.png new file mode 100644 index 0000000000000000000000000000000000000000..5d5df09671f88a3ef56dbcec2aa824c479af4f30 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab14.png differ diff --git a/assignment-2/submission/18300110042/img/lab14_loss.png b/assignment-2/submission/18300110042/img/lab14_loss.png new file mode 100644 index 0000000000000000000000000000000000000000..dd9c4c3ef9c89dabb7de5274a8d3e477e5e32442 Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab14_loss.png differ diff --git a/assignment-2/submission/18300110042/img/nn_model.png b/assignment-2/submission/18300110042/img/nn_model.png new file mode 100644 index 0000000000000000000000000000000000000000..964f1d1b689cd46847c73bc6f683dad36b9c9fa9 Binary files /dev/null and b/assignment-2/submission/18300110042/img/nn_model.png differ diff --git a/assignment-2/submission/18300110042/img/squashing_function_ReLU.png b/assignment-2/submission/18300110042/img/squashing_function_ReLU.png new file mode 100644 index 0000000000000000000000000000000000000000..482a57daecc727c979a033bef4d24328b800476a Binary files /dev/null and b/assignment-2/submission/18300110042/img/squashing_function_ReLU.png differ diff --git a/assignment-2/submission/18300110042/source.py b/assignment-2/submission/18300110042/source.py new file mode 100644 index 0000000000000000000000000000000000000000..0977e20798c9de00510147bdd009ffe4efe63072 --- /dev/null +++ b/assignment-2/submission/18300110042/source.py @@ -0,0 +1,543 @@ +import sys, time +import numpy as np +import matplotlib.pyplot as plt +#from mpl_toolkits import mplot3d + +class MatMul: + def __init__(self, W): + self.params = [W] + self.grads = [np.zeros_like(W)] + self.x = None + + def forward(self, x): + W, = self.params + out = np.dot(x, W) + self.x = x + return out + + def backward(self, dout): + W, = self.params + dx = np.dot(dout, W.T) + dW = np.dot(self.x.T, dout) + self.grads[0][...] = dW + return dx + +class Affine: + def __init__(self, W, b): + self.params = [W, b] + self.grads = [np.zeros_like(W), np.zeros_like(b)] + self.x = None + + def forward(self, x): + W, b = self.params + out = np.dot(x, W) + b + self.x = x + return out + + def backward(self, dout): + W, b = self.params + dx = np.dot(dout, W.T) + dW = np.dot(self.x.T, dout) + db = np.sum(dout, axis=0) + self.grads[0][...] = dW + self.grads[1][...] = db + return dx + +class ReLU: + def __init__(self): + self.params, self.grads = [], [] + self.x = None + + def forward(self, x): + self.mask = (x <= 0) + out = x.copy() + out[self.mask] = 0 + return out + + def backward(self, dout): + dout[self.mask] = 0 + dx = dout + return dx + +class IdentityWithMSELoss: + ''' + identity neuron node with mean square error(MSE) loss + ''' + def __init__(self): + self.params, self.grads = [], [] + + def forward(self, x, t): + self.y = x + self.t = t + loss = 0.5 * np.mean(np.sum(np.square(x - t), axis=1)) + return loss + + def backward(self, dout=1): + batch_size = self.t.shape[0] + dx = (self.y - self.t) * dout / batch_size + return dx + +class IdentityWithHuberLoss: + ''' + identity neuron node with Huber error loss + ''' + def __init__(self, beta=1.0): + self.params, self.grads = [], [] + self.beta = beta + + def forward(self, x, t): + self.y = x + self.t = t + l1 = np.sum(self.beta * (np.abs(x - t) - 0.5 * self.beta), axis=1) + l2 = 0.5 * np.sum(np.square(x - t), axis=1) + l1_mask = (x - t >= self.beta).astype(int) + l2_mask = 1 - l1_mask + loss = np.mean(l1 * l1_mask + l2 * l2_mask) + return loss + + def backward(self, dout=1): + batch_size = self.t.shape[0] + d_l1 = self.beta * np.abs(self.y - self.t) * dout / batch_size + d_l2 = (self.y - self.t) * dout / batch_size + l1_mask = (self.y - self.t >= self.beta).astype(int) + l2_mask = 1 - l1_mask + dx = d_l1 * l1_mask + d_l2 * l2_mask + return dx + +class SGD: + ''' + stochastic gradient descent(SGD) optimization + ''' + def __init__(self, lr=0.001): + self.lr = lr + + def update(self, params, grads): + for i in range(len(params)): + params[i] -= self.lr * grads[i] + +class Adam: + ''' + adaptive moment(Adam) optimization + ''' + def __init__(self, lr=0.001, beta1=0.9, beta2=0.999): + self.lr = lr + self.beta1 = beta1 + self.beta2 = beta2 + self.iter = 0 + self.m = None + self.v = None + + def update(self, params, grads): + if self.m is None: + self.m, self.v = [], [] + for param in params: + self.m.append(np.zeros_like(param)) + self.v.append(np.zeros_like(param)) + self.iter += 1 + lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter) + for i in range(len(params)): + self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i]) + self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i]) + params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7) + +class TwoReLULayerNet: + def __init__(self, input_size=1, hidden_size=10, output_size=1, + weight_scale='he', loss_func='huber'): + ''' + weight_scale: scale value used to initialize weight. + there are 3 possible value 'he', 'xavier' and 'std' + 'he': set sqrt(2/n) as weight init scale, proposed by Kaiming He + 'xavier': set sqrt(1/n) as weight init scale, proposed by Xavier Glorot + 'std': set 0.01 as weight init scale + loss_func: loss function type. there are 2 possible value here, 'huber' and 'mse' + 'huber': Huber error loss + 'mse': MSE loss + ''' + self.input_size = input_size + self.hidden_size = hidden_size + self.output_size = output_size + self.weight_scale = weight_scale + self.loss_func = loss_func + # init w and b + scale = self.get_weight_scale(input_size) + W1 = scale * np.random.randn(input_size, hidden_size) + b1 = np.random.randn(hidden_size) + scale = self.get_weight_scale(hidden_size) + W2 = scale * np.random.randn(hidden_size, output_size) + # init layers + self.layers = [Affine(W1, b1), ReLU(), MatMul(W2)] + if loss_func == 'huber': + self.loss_layer = IdentityWithHuberLoss() + else: # mse loss function + self.loss_layer = IdentityWithMSELoss() + # init parameters and gradients + self.params, self.grads = [], [] + for layer in self.layers: + self.params += layer.params + self.grads += layer.grads + + def get_weight_scale(self, node_size): + scale = 0.01 + if self.weight_scale == 'he': + scale = np.sqrt(2.0 / node_size) + elif self.weight_scale == 'xavier': + scale = np.sqrt(2.0 / node_size) + return scale + + def predict(self, x): + for layer in self.layers: + x = layer.forward(x) + return x + + def forward(self, x, t): + y = self.predict(x) + loss = self.loss_layer.forward(y, t) + return loss + + def backward(self, dout=1): + dout = self.loss_layer.backward(dout) + for layer in reversed(self.layers): + dout = layer.backward(dout) + return dout + +class Trainer: + def __init__(self, model, optimizer): + self.model = model + self.optimizer = optimizer + self.loss_list = [] + self.eval_interval = None + + def fit(self, x, t, max_epoch=10, batch_size=36, eval_interval=20): + data_size = len(x) + max_iters = data_size // batch_size + self.eval_interval = eval_interval + model, optimizer = self.model, self.optimizer + epoch_loss, epoch_loss_count = 0, 0 + start_time = time.time() + for epoch in range(max_epoch): + idx = np.random.permutation(np.arange(data_size)) + x = x[idx] + t = t[idx] + iter_loss, iter_loss_count = 0, 0 + for iter in range(max_iters): + batch_x = x[iter*batch_size:(iter+1)*batch_size] + batch_t = t[iter*batch_size:(iter+1)*batch_size] + loss = model.forward(batch_x, batch_t) + model.backward() + optimizer.update(model.params, model.grads) + epoch_loss += loss + epoch_loss_count += 1 + iter_loss += loss + iter_loss_count += 1 + if (eval_interval is not None) and (iter % eval_interval) == 0: + iter_avg_loss = iter_loss / iter_loss_count + self.loss_list.append(float(iter_avg_loss)) + iter_loss, iter_loss_count = 0, 0 + epoch_avg_loss = epoch_loss / epoch_loss_count + elapsed_time = time.time() - start_time + print('| epoch %d | time %d[s] | loss %.2f' % (epoch + 1, elapsed_time, epoch_avg_loss)) + epoch_loss, epoch_loss_count = 0, 0 + + def plot(self, ylim=None): + x = np.arange(len(self.loss_list)) + if ylim is not None: + plt.ylim(*ylim) + plt.plot(x, self.loss_list, label='train') + plt.xlabel('iterations (x' + str(self.eval_interval) + ')') + plt.ylabel('loss') + plt.show() + +def grid_search_cv(x, t, param_grid, max_epoch=5): + input_size = x.shape[1] + output_size = t.shape[1] + hidden_size, weight_scale, loss_func, batch_size, learning_rate = np.meshgrid( + param_grid['hidden_size'], + param_grid['weight_scale'], + param_grid['loss_func'], + param_grid['batch_size'], + param_grid['learning_rate']) + params = [p for p in zip(hidden_size.flat, weight_scale.flat, + loss_func.flat, batch_size.flat, learning_rate.flat)] + min_score = 99999999 + best_param = {} + for param in params: + print(param) + # split train data and validation data + (x_train, t_train), (x_val, t_val) = split_dataset(x, t, 0.2) + # train model with train data + model = TwoReLULayerNet(input_size, param[0], output_size, param[1], param[2]) + optimizer = Adam(param[4]) + trainer = Trainer(model, optimizer) + trainer.fit(x_train, t_train, max_epoch, param[3], None) + # evaluate model with validation data + score = model.forward(x_val, t_val) + print('score: ', score) + if score < min_score: + min_score = score + best_param['hidden_size'] = param[0] + best_param['weight_scale'] = param[1] + best_param['loss_func'] = param[2] + best_param['batch_size'] = param[3] + best_param['learning_rate'] = param[4] + return best_param + +def split_dataset(x, t, rate=0.2): + N = x.shape[0] + idx = np.random.permutation(np.arange(N)) + x = x[idx] + t = t[idx] + split_idx = int(N * rate) + x_train = x[split_idx:] + t_train = t[split_idx:] + x_test = x[:split_idx] + t_test = t[:split_idx] + return (x_train, t_train), (x_test, t_test) + +def load_data(func, params): + # parse func parameters + x = params.get('x', np.array([])) + if len(x) == 1: + x = x[0] + else: + d_grid = np.meshgrid(*x) + x = np.array([p for p in zip(*[d.flat for d in d_grid])]) + kwargs = params.get('kwargs', np.array([])) + # execute function to get 't' + t = eval(func) + # reshape x and t + N = x.shape[0] + x = x.reshape(N, -1) + N = t.shape[0] + t = t.reshape(N, -1) + return x, t + +def relu_func(x): + y = np.zeros_like(x) + y[...] = x + y[x < 0] = 0 + return y + +def bump_func_1d(x, x1, x2, h): + y = np.ones_like(x) * h + y[x < x1] = 0 + y[x > x2] = 0 + return y + +def approx_bump_1d(x, x1, x2, h, d): + y1 = d * h * relu_func(x - x1) / (x2 - x1) + y2 = -d * h * relu_func(x - (x1 + (x2 - x1) / d)) / (x2 - x1) + y3 = -d * h * relu_func(x - (x2 - (x2 - x1) / d)) / (x2 - x1) + y4 = d * h * relu_func(x - x2) / (x2 - x1) + return y1 + y2 + y3 + y4 + +def rsqare(x, t): + mse = np.mean(np.sum(np.square(x - t), axis=1)) + loss = 1 - mse / np.var(t) + return loss + +def approx_func3(x): + sum = np.zeros_like(x) + for i in range(10): + t = np.sin((2 * i + 1) * x) / (2 * i + 1) + sum += t + return 0.5 - 2 * sum / np.pi + +def approx_func4(x): + sum = np.zeros_like(x) + for i in range(10): + t = np.cos((np.power(17,i) * np.pi * x) / np.power(2,i)) + sum += t + return sum + +def approx_func5(x): + t = np.exp(-1 / (1 - np.square(x))) + t[np.abs(x) >= 1] = 0 + return t + +def approx_func6(x): + t = np.exp(-1 / x) + t[x <= 0] = 0 + return t + +max_epoch = 50 +eval_interval = 100 +data_size = 10000 +x_begin, x_end = -2, 2 +param_grid = { + 'hidden_size': [10, 100, 1000], + 'batch_size': [30, 60, 90], + 'learning_rate': [1e-2, 1e-4, 1e-6], + 'weight_scale': ['he'], #['he', 'xavier', 'std'], + 'loss_func': ['huber'] #['huber', 'mse'] +} +l_config = [ + { # 1 + 'func': 'relu_func(x + 0.5) - relu_func(x - 0.5)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)], 'kwargs': {'x1': -1, 'x2': 1, 'h': 5, 'd': 1000}}, + 'approx': 'draw', + 'title': 'A squashing function constructed with ReLU' + }, + { # 2 + 'func': 'approx_bump_1d(x, **kwargs)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)], 'kwargs': {'x1': -1, 'x2': 1, 'h': 5, 'd': 1000}}, + 'approx': 'draw', + 'title': '' + }, + { # 3 + 'func': '1 - x + np.power(x,2) - np.power(x,3) + np.power(x,4) - np.power(x,5)/5 + np.power(x,6)/6 - np.power(x,7)/7', + 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]}, + 'approx': 'split', + 'split_num': 30, + 'title': '' + }, + { # 4 + 'func': 'x - np.power(x,3)/3 + np.power(x,5)/5 - np.power(x,7)/7 + np.power(x,9)/9 - np.power(x,11)/11', + 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]}, + 'approx': 'split', + 'split_num': 30, + 'title': '' + }, + { # 5 + 'func': 'approx_func3(x)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]}, + 'approx': 'split', + 'split_num': 30, + 'title': '' + }, + { # 6 + 'func': 'approx_func4(x)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size*10000)]}, + 'approx': 'split', + 'split_num': 300, + 'title': '' + }, + { # 7 + 'func': 'approx_func5(x)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]}, + 'approx': 'split', + 'split_num': 30, + 'title': '' + }, + { # 8 + 'func': 'approx_func6(x)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]}, + 'approx': 'split', + 'split_num': 30, + 'title': '' + }, + { # 9 + 'func': '1 - x + np.power(x,2) - np.power(x,3) + np.power(x,4) - np.power(x,5)/5 + np.power(x,6)/6 - np.power(x,7)/7', + 'params': {'x': [np.linspace(x_begin, x_end, data_size)]}, + 'approx': 'dnn', + 'title': '' + }, + { # 10 + 'func': 'x - np.power(x,3)/3 + np.power(x,5)/5 - np.power(x,7)/7 + np.power(x,9)/9 - np.power(x,11)/11', + 'params': {'x': [np.linspace(x_begin, x_end, data_size)]}, + 'approx': 'dnn', + 'title': '' + }, + { # 11 + 'func': 'approx_func3(x)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size)]}, + 'approx': 'dnn', + 'title': '' + }, + { # 12 + 'func': 'approx_func4(x)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size)]}, + 'approx': 'dnn', + 'title': '' + }, + { # 13 + 'func': 'approx_func5(x)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size)]}, + 'approx': 'dnn', + 'title': '' + }, + { # 14 + 'func': 'approx_func6(x)', + 'params': {'x': [np.linspace(x_begin, x_end, data_size)]}, + 'approx': 'dnn', + 'title': '' + }, +] + +if __name__ == '__main__': + max_lab = len(l_config) + + # parse user input + if len(sys.argv) > 1 and sys.argv[1].isnumeric(): + lab_num = int(sys.argv[1]) + else: + print('Usage: python source lab_num (1 ~ {0})'.format(max_lab)) + sys.exit(-1) + + if lab_num > max_lab or lab_num < 1: + print('Usage: python source lab_num (1 ~ {0})'.format(max_lab)) + sys.exit(-1) + + # prepare dataset + func = l_config[lab_num-1]['func'] + params = l_config[lab_num-1]['params'] + approx = l_config[lab_num-1]['approx'] + title = l_config[lab_num-1]['title'] + x_data, t_data = load_data(func, params) + + dim = x_data.shape[1] + if dim > 1: + #print('Cannot draw diagram with dimension higher tha 3.') + print('Only implement senario with dimension 1') + exit(-1) + + # approximate function + if approx == 'dnn': + (x_train, t_train), (x_test, t_test) = split_dataset(x_data, t_data) + # grid search best hyperparameters + best_param = grid_search_cv(x_train, t_train, param_grid, max_epoch=6) + # create neuron network to approximate target function + model = TwoReLULayerNet(x_train.shape[1], + best_param['hidden_size'], + t_train.shape[1], + best_param['weight_scale'], + best_param['loss_func']) + optimizer = Adam(best_param['learning_rate']) + trainer = Trainer(model, optimizer) + trainer.fit(x_train, t_train, max_epoch, best_param['batch_size'], eval_interval) + # evaluate model with test data + print('best param: ', best_param) + t_predict = model.predict(x_test) + test_loss = rsqare(t_predict, t_test) + print('| final test | loss %.8f' % (test_loss)) + # draw diagram + trainer.plot() + plt.plot(x_data, t_data) + plt.plot(x_data, model.predict(x_data), color='coral') + elif approx == 'split': + # split x for each dimention + plt.plot(x_data, t_data) + split_num = l_config[lab_num-1]['split_num'] + x_split = np.linspace(x_begin, x_end, split_num+1) + x_sample = np.array([(x_split[i] + x_split[i+1]) / 2 for i in range(len(x_split)-1)]) + xs = np.array_split(x_data, split_num) + t_bumps = np.array([]) + for i in range(split_num): + # get h + params['x'] = [x_sample[i:i+1]] + _, h = load_data(func, params) + # get x1, x2 + x1 = x_split[i] + x2 = x_split[i+1] + # load bump function data + bump_params = {} + bump_params['x'] = [xs[i]] + bump_params['kwargs'] = {'x1': x1, 'x2': x2, 'h': h[0], 'd': 1000} + bump_func = 'approx_bump_1d(x, **kwargs)' + x_bump, t_bump = load_data(bump_func, bump_params) + t_bumps = np.concatenate((t_bumps, t_bump.flatten())) + plt.plot(x_bump, t_bump, color='coral') + else: + plt.plot(x_data, t_data, color='coral') + plt.title(title) + plt.grid() + plt.show() \ No newline at end of file