diff --git a/assignment-2/submission/18300110042/README.md b/assignment-2/submission/18300110042/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2aedb1aa0ec1150261a4b0b8669fd7044588a416
--- /dev/null
+++ b/assignment-2/submission/18300110042/README.md
@@ -0,0 +1,606 @@
+# 课程报告
+这是`prml-21-spring/assignment-2` 中 `选题2`的课程报告,我的代码在 [source.py](source.py) 中。
+
+报告的内容是理论和实验证明双层 $\text{ReLU}$ 神经网络能够拟合任意有界闭集上的连续函数。
+
+这次报告的主要内容如下:
+
+1. 理论证明
+2. 实验证明
+ - 神经网络模型
+ - 实现方法1 (split into bumps)
+ - 实现方法2 (back propagation)
+ - 拟合三类函数
+ - 多项式函数
+ - 三角多项式函数
+ - 其他函数
+
+
+## 理论证明
+首先,定义如下函数:
+
+$$\Psi_{Relu}(t) = \text{ReLU}(t + 0.5) - \text{ReLU}(t - 0.5).$$
+
+其中 $t \in \mathcal{R}$; $\text{ReLU}(t) = \max(0, t)$,
+
+对于 $\Psi_{Relu}(t)$, 也可以看做一个分段线性的函数:
+
+$$\Psi_{Relu}(t) = \begin{cases} 0, && t \le -0.5 \\\\ t + 0.5, && -0.5 < t < 0.5 \\\\ 1, && t \ge 0.5 \end{cases} .$$
+
+如图:
+
+
+
+> lab_number = 1
+
+$\Psi_{Relu}: \mathcal{R} \to [0,1]$ 满足条件:
+
+1. 连续(continuous)非减(non-decreasing)函数;
+2. $\lim_{\lambda \to +\infty} \Psi_{Relu}(\lambda) = 1$;
+3. $\lim_{\lambda \to -\infty} \Psi_{Relu}(\lambda) = 0$.
+
+根据通用近似定理 [Hornik et al., 1989](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators") 中的 `Definition 2.3`,$\Psi_{Relu}(\cdot)$ 是一个 `squashing function` (挤压函数) .
+
+由【[Hornik et al., 1989](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators")】`Theorem 2.4`,通用近似定理在有界闭集上成立,对于一个 `squashing function` $\Psi(\cdot)$ ,令 $\mathcal{I}_r$ 为 $\mathcal{R}^r$ 上的任一有界闭集, $\mathcal{C}(\mathcal{I}_r)$ 为 $\mathcal{I}_r$ 中连续函数的集合,则函数集合
+
+$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{N} \alpha_{j} \Psi(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j), \space \boldsymbol{w} \in \mathcal{R}^r, \alpha\_j, b\_j \in \mathcal{R}, N \in \mathcal{N_+} \\}$$
+
+在 $\mathcal{C}(\mathcal{I}_r)$ 上是一致稠密的(uniformly dense)。
+
+也就是说,
+
+对于 $\forall f \in \mathcal{C}(\mathcal{I}_r), \space \exists N \in \mathcal{N\_+}, \space \alpha_j, b_j \in \mathcal{R}, \space \boldsymbol{w}_j \in \mathcal{R}^r, (j = 1, \cdots, N)$ ,使得函数
+
+$$G(\boldsymbol{x}) = \sum_{j=1}^{N} \alpha_{j} \Psi(\boldsymbol{w}_j^T \boldsymbol{x} + b_j)$$
+
+可以以任意精度近似函数 $f$ ,即:
+
+对于 $\forall \epsilon > 0$ ,有
+
+$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{N} {\alpha_{j} \Psi(\boldsymbol{w}_j^T \boldsymbol{x} + b_j)} - f(\boldsymbol{x})| < \epsilon, \space \forall \boldsymbol{x} \in \mathcal{I}_r.$$
+
+将 $\Psi_{Relu}(\boldsymbol{x}) = \text{ReLU}(\boldsymbol{x} + 0.5) - \text{ReLU}(\boldsymbol{x} - 0.5)$ 代入上式,即可得:
+
+$$\begin{align} |G(\boldsymbol{x}) - f(\boldsymbol{x})| &= |\sum_{j=1}^{N} {\alpha_{j} \Psi\_{Relu}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j)} - f(\boldsymbol{x})| \\\\ &= |\sum_{j=1}^{N}{\alpha_j [\text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j + 0.5) - \text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j - 0.5)]} - f(\boldsymbol{x})| \\\\ &=|\sum_{j=1}^{N}{\alpha_j \text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j + 0.5)} - \sum_{j=1}^{N}{\alpha_j \text{ReLU}(\boldsymbol{w}_j^T \boldsymbol{x} + b_j - 0.5)} - f(\boldsymbol{x})| \\\\ < \epsilon, \end{align}$$
+
+
+进一步调整,可以得到:
+
+$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{2N} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)} - f(\boldsymbol{x})| < \epsilon, $$
+
+
+其中
+$$\tilde{\alpha}_j = \begin{cases} \alpha_j, && 1 \le j \le N \\\\ -\alpha_j, && N < j \le 2N \end{cases}, $$
+
+$$\tilde{\boldsymbol{w}}_j = \begin{cases} \boldsymbol{w}_j, && 1 \le j \le N \\\\ \boldsymbol{w}\_{j-N}, && N < j \le 2N \end{cases}, $$
+
+$$\tilde{b}_j = \begin{cases} b_j + 0.5, && 1 \le j \le N \\\\ b_j - 0.5, && N < j \le 2N \end{cases}. $$
+
+记 $M = 2N \in \mathcal{N_+} $ ,即可得:
+
+$$G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)},$$
+
+于是,
+
+对于 $\forall \epsilon > 0, \space \forall f \in \mathcal{C}(\mathcal{I}_r), \space \exists M \in \mathcal{N\_+}, \space \alpha_j, b_j \in \mathcal{R}, \space \boldsymbol{w}_j \in \mathcal{R}^r, (j = 1, \cdots, N)$ ,使得
+
+$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)} - f(\boldsymbol{x})| < \epsilon, \space \forall \boldsymbol{x} \in \mathcal{I}_r. $$
+
+即集合
+
+$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}\_j^T \boldsymbol{x} + \tilde{b}\_j)}, \space \boldsymbol{w} \in \mathcal{R}^r, \alpha\_j, b\_j \in \mathcal{R}, M \in \mathcal{N_+}\\}$$
+
+在 $\mathcal{C}(\mathcal{I}_r)$ 上是一致稠密的。
+
+该集合正是两层 $\text{ReLU}$ 神经网络能够实现的所有函数的集合,其中 $M \in \mathcal{N_+}$ 即隐藏层的神经元数量。
+
+也就是说,对于两层 $\text{ReLU}$ 神经网络,我们一定能够通过有限个数($M \in \mathcal{N_+}$)的隐藏层神经元,以任意精度近似任何一个定义在实数空间 $\mathcal{R}^r$ 中的有界闭集 $\mathcal{I}_r$ 上的连续函数。
+
+
$\blacksquare$
+
+
+
+## 实验证明
+通过以上的理论证明,我们知道两层 $\text{ReLU}$ 神经网络理论上能够以任意精度近似每一个定义在实数空间 $\mathcal{R}^r$ 上有界闭集 $\mathcal{I}_r$ 上的连续函数。但在实践中,如何调整神经网络的参数来完成这个任务,还需要进一步的探讨。
+
+我们在下面尝试采用两种不同的方法来拟合给定的函数:
+
+1. 根据理论证明,通过几个 $\text{ReLU}$ 的神经元,在每一个给定的区间上对函数值进行拟合;
+2. 根据常用的神经网络调整参数的方法——梯度下降,拟合对应区间内给定的函数。
+
+目标是要尽可能好地拟合给定的函数。
+
+### 神经网络模型
+根据证明,我们设计如下神经网络:
+1. 输入 $\boldsymbol{x} \in \mathcal{R}^D$ ,经过一个 `affine` 层,得到 $$\boldsymbol{z} = \boldsymbol{x} \boldsymbol{w} + \boldsymbol{b} \in \mathcal{R}^M; $$
+
+2. 而后通过 `ReLU` 激活层,得到 $$\boldsymbol{h} = \text{ReLU} (\boldsymbol{z}) = \max(\boldsymbol{z}, \boldsymbol{0}) \in \mathcal{R}^M; $$
+
+3. 最后再通过一个线性映射,得到最终的输出 $$y = \boldsymbol{h} \boldsymbol{\alpha} \in \mathcal{R}$$
+
+其中,$D, M \in \mathcal{N_+}$ ,$D$ 为输入的维度, $M$ 为隐藏层神经元个数, $\boldsymbol{w_1}$ 为 $D \times M$ 的矩阵, $\boldsymbol{w_2}$ 为 $M$ 维的向量。
+
+模型大致结构如下:
+
+
+
+这个神经网络能够实现的函数集合就是证明中的
+$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}\_j^T \boldsymbol{x} + \tilde{b}\_j)}, \space \boldsymbol{w} \in \mathcal{R}^D, \alpha\_j, b\_j \in \mathcal{R}, M \in \mathcal{N_+}\\}.$$
+
+我们的神经网络模型能够处理任意维度( $D$ )的输入,但在接下去的实验中,为了方便可视化,实现的是针对 $1$ 维输入的拟合,即 $\mathcal{R} \to \mathcal{R}$ 的函数。
+
+
+### 实现方法 1 —— split into bumps
+为了拟合一个定义在闭区间 $[a, b]$ 上函数 $F(x)$ ,可以将闭区间 $[a, b]$ 分割为有限个子区间(subinterval),即在 $[a, b]$ 间找到有限个点:
+
+$$a = x_0 < x_1 < x_2 < \cdots < x_n = b, $$
+
+形成 $N$ 个子区间 $\Delta = [x_i, x_{i+1}]$ ,对该区间上的函数值进行拟合。
+
+记该区间长度为 $\delta$ ,当 $\delta \to 0$ 时,我们就可以精确的近似函数 $F(x)$ .
+
+这和 [Riemann integral](https://en.wikipedia.org/wiki/Riemann_integral "Riemann integral") 的概念类似。
+
+具体而言,我们将给定的闭区间分割为等长的 $N$ 个子区间,令每一个子区间上的预测值等于该区间中点处的函数值。
+
+我们定义如下的 `bump function` :
+$$b(x_1, x_2, h) = \begin{cases} 0, && x \le x_1 \\\\ h, && x_1 < x < x_2 \\\\ 0, && x \ge x_2 \end{cases}, $$
+
+其中 $x_1 < x_2$ .
+
+`bump` 的宽度是 $\delta = x_2 - x_1$ ,高度是 $h$ 。
+
+根据理论证明,我们可以通过两个 $\text{ReLU}$ 函数(即两个隐藏层的神经元)来实现一个挤压函数(`squashing function`);而实际上,通过两个挤压函数就可以近似一个 `bump function` ,通过控制每一个的 `bump` 的高度和宽度,我们可以在任意小的区间 $\Delta$ 上通过 `bump` 来近似该区间对应的函数值。
+
+近似 `bump function` 的方法:
+
+通过以下 $4$ 个 $\text{ReLU}$ 函数近似给定的 `bump function` $b(x_1, x_2, h)$ :
+
+$$\begin{align} y_1 &= \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU}(x - x_1) \\\\ y_2 &= - \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} [x - (x_1 +\frac{x_2 - x_1}{d})] \\\\ y_3 &= - \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} [x - (x_2 - \frac{x_2 - x_1}{d})] \\\\ y_4 &= \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} (x - x_2) \end{align}$$
+
+最终得到 $y = \sum_{i=1}^{4} {y_i}$ 作为 `bump function` 的近似。
+
+其中 $h$ 即 `bump function` 的高度;
+
+$d$ 决定了 $\text{ReLU}$ 函数的斜率: `bump` 的宽度为 $x_2 - x_1$ ,而在其两侧的间断点 $x_1, x_2$ 处,$\text{ReLU}$ 函数需要通过斜率较大的直线近似,我们令这个区间的长度为 $\frac{x_2 - x_1}{d}$ 。
+
+于是,当 $d \to + \infty$ 时, $\text{ReLU}$ 函数的斜率趋近于 $+ \infty$ ,近乎垂直于横轴,能够无限接近间断点处的函数。
+
+
+我们实现 `bump` 的模型如下:
+
+
+
+得到函数:
+$$y = \sum_{i=1}^{4} {y_i} , $$
+$$y_i = a_i \cdot \text{ReLU} (w_i x + b_i), \space i = 1,2,3,4$$
+
+其中,
+
+$$w_1 = w_2 = w_3 = w_4 = 1 ; $$
+
+$$\begin{align} a_1 &= a_4 = \frac{d \times h}{x_2 - x_1} \\\\ a_2 &= a_3 = -\frac{d \times h}{x_2 - x_1} ; \end{align}$$
+
+$$\begin{align} b_1 &= -x_1 \\\\ b_2 &= - (x_1 +\frac{x_2 - x_1}{d}) \\\\ b_3 &= - (x_2 - \frac{x_2 - x_1}{d}) \\\\ b_4 &= - x_2 . \end{align}$$
+
+如图,为我们通过 $4$ 个 $\text{ReLU}$ 函数实现的一个 `bump` :
+
+
+
+> lab_number = 2
+
+这样,通过 $4N$ 个隐藏层的神经元,拟合 $N$ 个 `bump` ,可以拟合任意有界闭区间上连续的函数 $F(x)$ ,当 $N \to \infty, \space \delta \to 0$ 时,神经网络实现的函数 $G(x)$ 可以无限接近 $F(x)$ .
+
+***Reference***
+
+该方法参考 [Michael Nielsen (2019). Neural Networks and Deep Learning.](http://neuralnetworksanddeeplearning.com/chap4.html "Neural Networks and Deep Learning")
+
+
+### 实现方法 2 —— Backpropagation
+在 ***实现方法1*** 中,我们通过计算的方式设置了神经网络的参数,这种方式很容易理解,但需要非常多的 `bump` ( $N \to \infty$ )来精确的拟合函数,对应的神经元数量是 `bump` 数量的 $4$ 倍( $4N \to \infty$ ),原因就在于没有充分利用全连接神经网络中的参数。
+
+在这里,对于相同结构的神经网络模型,我们使用反向传播算法自动训练神经网络的参数。具体步骤为:
+
+1. 通过定义的函数产生样本数据;
+2. 前向传播后得到损失(采用 `mini-batch` 的方式);
+3. 根据损失计算梯度,通过反向传播调整模型参数.
+
+#### 参数初始化
+由于在 Backpropagation 的方法中,神经网络参数的初始化对最终结果有一定影响,考虑到神经网络的激活函数是 $\text{ReLU}$ 函数,采用 $\text{Kaiming}$ 初始化的方法(参考 [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/pdf/1502.01852.pdf "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification"))。
+
+方法如下:
+
+记第 $l$ 层的参数为 $\boldsymbol{w_l}$ ,该层的神经元个数为 $n_l$ ,令
+$$\boldsymbol{w_l} \sim \mathcal{N}(0, \sqrt{\frac{2}{n_l}}).$$
+
+与论文中略微不同,实验中第一层网络中的 $\boldsymbol{b}$ 也通过这个方式进行了初始化。
+
+具体的,我们通过 `np.random.randn` 生成服从 $\mathcal{N}(0, 1)$ 的数,而后乘以 $\sqrt{\frac{2}{n_l}}$ 来调整方差。
+
+#### Loss Function
+对于回归问题,常见的损失函数有 `MSE (Mean Squared Error)` 、 `MAE (Mean Absolute Error)` 和 `Huber Loss`.
+
+##### MSE (Mean Squared Error)
+`MSE` 衡量的是模型预测值与真实值之间的距离平方的平均值。
+记 $\hat{y}$ 为模型预测值, $y$ 为函数真实值,对于 $N$ 个样本,
+$$MSE = \frac{1}{N} \sum_{i=1}^{N}{(\hat{y}_i - y_i)^2}. $$
+
+`MSE` 曲线光滑、连续、可导,且随着误差的减小,梯度也会减小,便于收敛;但是对于一些异常值较为敏感,可能会降低模型的性能。
+
+##### MAE (Mean Absolute Error)
+相对应的,
+$$MAE = \frac{1}{N} \sum_{i=1}^{N}{|\hat{y}_i - y_i|}.$$
+
+`MAE` 连续但在 $\hat{y}_i - y_i = 0$ 处不可导,且其他区域处处梯度相同,不利于函数收敛以及梯度下降的学习;但另一方面,由于不同大小的误差对应的梯度相同, `MAE` 对异常值不太敏感,可以弥补 `MSE` 这方面的问题.
+
+##### Huber Loss
+`Huber Loss` 将 `MSE` 和 `MAE` 结合起来,公式为:
+$$\mathcal{L}_{\lambda} (\hat{y}, y) = \begin{cases} \frac{1}{2}(\hat{y}_i - y_i)^2, && |\hat{y}_i - y_i| \le \lambda \\\\ \lambda |\hat{y}_i - y_i| - \frac{1}{2} \lambda ^2, && |\hat{y}_i - y_i| > \lambda \end{cases}$$
+
+其中,超参数 $\lambda$ 决定了对 `MSE` 和 `MAE` 的侧重性,当 $\lambda$ 较大时,我们更多的使用 `MSE` ,当 $\lambda$ 较小时,我们更多的使用 `MAE` .
+
+在实验中,我们采用 `Huber Loss` 作为损失函数,令 $\lambda = 1.0$ .
+
+#### 参数优化方法
+在实验中,尝试了两种优化参数的算法: `SGD` (随机梯度下降)和 `Adam` (参考 [Adam](https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c#:~:text=Adam%20is%20an%20adaptive%20learning%20rate%20method%2C%20which,rate%20for%20each%20weight%20of%20the%20neural%20network. "Adam") )。
+
+对于 `Adam` 的参数,设定 $\beta_1 = 0.9, \space \beta_2 = 0.999.$
+
+在实验中统一使用了 `Adam` 。
+
+#### Grid Search
+实现的代码可以对 `learning_rate` , `batch_size` , `hidden_size` 等超参数进行网格搜索。
+
+搜索时验证集占整个训练集数据的 $20\%$ .
+
+#### 评价方式
+我们通过给定的函数先生成 $10,000$ 个样本数据,将其中的 $20\%$ 的数据划为测试集,而后对模型在测试集上的表现进行评价。(最终选定的模型会训练 `50` 个 `epoch` )。
+
+选择 `R squared (coefficient of determination)` (也叫 `Proportion of Variance Explained` )对拟合结果进行评价,具体计算方法为:
+
+$$\begin{align}R^2 &= 1 - \frac{\text{Unexplained Variation}}{\text{Total Variation}} \\\\ &= 1 - \frac{\sum_i{(\hat{y}_i - y_i})^2}{\sum_i{(y_i - \bar{y})^2}} \\\\ &= 1 - \frac{MSE}{Variance}\end{align}$$
+
+其中, $\hat{y}_i$ 为模型的预测值; $y_i$ 为真实值; $\bar{y}$ 为真实值的均值。
+
+$R^2 \in (-\infty, 1]$ , $R^2$ 的值越大,说明模型预测效果越好;
+
+当 $R^2 = 1$ 时,模型的预测值全都等于真实值;当 $R^2 = 0$ 时,一种可能的情况是“模型简单的预测所有的值都等于 $y$ 的平均值”;
+
+$R^2$ 一般在 $[0, 1]$ 之间,当 $R^2 < 0$ 时,模型预测非常差,可能用了错误模型或模型假设不合理。
+
+$R^2$ 因为去除了真实值本身方差 $Variance$ 的因素,因而可以在不同的情况下进行比较。
+
+
+
+### 拟合三类函数
+这里对以下三类函数进行拟合:
+1. 多项式函数
+2. 三角多项式函数
+3. 其他函数
+
+选择这三类函数的原因是:
+
+根据我已有的数学知识,通过 $\text{Taylor}$ 级数(多项式函数) 和 $\text{Fourier}$ 级数(三角多项式函数)可以拟合大多数常见的函数;
+
+而其他函数中则包括了一些较为少见的 `pathological functions` ,来进一步测试神经网络的拟合能力。
+
+实验中对每一个给定的函数观察在 $[-2, 2]$ 闭区间内的拟合情况。
+
+对于 `实现方法1 (split into bumps)` ,统一规定将区间分割为 $30$ 个子区间。
+
+#### 多项式函数
+##### Taylor 公式
+根据 [Taylor's Theorem](https://en.wikipedia.org/wiki/Taylor%27s_theorem "Taylor's Theorem") ,若函数 $f$ 在 $x_0$ 的某个邻域 $O(x_0, r)$ 上具有 $n+1$ 阶导数,且其 $n+1$ 阶余项收敛于 $0$ ,那么我们可以通过 $\text{Taylor}$ 级数的部分和,即 $n$ 次 $\text{Taylor}$ 多项式
+$$f(x_0) + f'(x_0)(x - x_0) + \frac{f''(x_0)}{2!}(x - x_0)^2 + \cdots + \frac{f^{(n)}(x_0)}{n!}(x - x_0)^n$$
+在 $O(x_0, r)$ 上近似 $f(x)$ 。
+
+虽然 $\text{Taylor}$ 公式要求函数 $n+1$ 阶可导,且只是局部近似,但根据 [Weierstrass Approximation Theorem](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem#Weierstrass_approximation_theorem "Weierstrass Approximation Theorem") ,任意一个在有界闭区间 $[a, b]$ 上连续的实函数 $f(x)$ 都可以展开成在这个区间上一致收敛的多项式函数,即:
+对于 $\forall \epsilon > 0$ ,存在一个多项式函数 $P(x)$ ,使得
+$$|P(x) - f(x)| < \epsilon, \space x \in [a, b].$$
+
+因此,若两层 $\text{ReLU}$ 神经网络能够拟合任意的多项式函数,那么它就能够拟合任意有界闭区间上的连续函数。
+
+##### 实验拟合函数
+我们选择以下两个多项式函数进行拟合:
+1. $$f(x) = 1 - x + x^2 - x^3 + x^4 - x^5 + x^6 - x^7$$
+
+***实现方法1***
+
+
+
+> lab_number = 3
+
+***实现方法2***
+
+经过网格搜索(每组超参数训练 `6` 个 `epoch`)后,选择的超参数为:
+
+| Hyperparameter | Value |
+| :--------: | :-----: |
+| hidden_size | 1000 |
+| batch_size | 90 |
+| learning_rate | 0.01 |
+
+模型学习过程中的损失 `loss` 变化如下:
+
+
+
+拟合函数情况如下:
+
+
+
+> lab_number = 9
+
+最终的损失
+$$R^2 = 0.99984384$$
+非常接近于 $1$ ,可以判断模型拟合的非常好。
+
+
+2. $$f(x) = x - \frac{x^3}{3} + \frac{x^5}{5} - \frac{x^7}{7} + \frac{x^9}{9} - \frac{x^{11}}{11}$$
+
+***实现方法1***
+
+
+
+> lab_number = 4
+
+***实现方法2***
+
+经过网格搜索(每组超参数 `6` 个 `epoch`)后,选择的超参数为:
+
+| Hyperparameter | Value |
+| :--------: | :-----: |
+| hidden_size | 1000 |
+| batch_size | 30 |
+| learning_rate | 0.01 |
+
+模型学习过程中的损失 `loss` 变化如下:
+
+
+
+拟合函数情况如下:
+
+
+
+> lab_number = 10
+
+最终的损失
+$$R^2 = 0.99880380$$
+非常接近于 $1$ 。
+
+
+#### 三角多项式函数
+##### Fourier 级数
+根据 [Fourier Series](https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118548448.app1 "Fourier Series") ,若函数 $f$ $\text{Riemann}$ 可积,且在给定的周期 $[-T,T]$ 内,满足下两个条件之一:
+1. 分段单调且有界;
+2. 分段可导,
+
+则 $f$ 的 $\text{Fourier}$ 展开
+$$f(x) \sim \frac{a_0}{2} + \sum_{n=1}^{\infty}{(a_n \cos nx + b_n \sin nx)}, $$
+
+在每一点均收敛,具体的,在 $x$ 点处收敛于
+$$\frac{f(x-0) + f(x+0)}{2}.$$
+
+也就是说,当 $x$ 是 $f$ 的连续点时,$\text{Fourier}$ 级数收敛于 $f(x)$ ;对于左右极限存在的间断点,级数收敛于左右极限的算术平均值。
+
+因而可以通过 $\text{Fourier}$ 级数的部分和,近似函数 $f$ .
+
+若两层 $\text{ReLU}$ 神经网络能够拟合任意的三角多项式函数,它就能够以任意精度拟合有界闭区间上的连续函数。
+
+##### 实验拟合函数
+我们在这里尝试拟合各种类型的三角级数($\text{Fourier}$ 级数)的部分和函数。
+
+与 $\text{Taylor}$ 级数不同,我们选择一些本身有有限间断点、或不可导的函数展开形成的 $\text{Fourier}$ 级数。
+
+1. $$f(x) = \frac{1}{2} - \frac{2}{\pi} \sum_{n=1}^{10} {\frac{\sin (2n+1)x}{2n+1}}$$
+
+该函数是 $F(x) = \begin{cases} 1, && x \in [-\pi, 0) \\\\ 0, && x \in [0, \pi)\end{cases}$ 的 $\text{Fourier}$ 展开的部分和。
+
+***实现方法1***
+
+
+
+> lab_number = 5
+
+***实现方法2***
+
+经过网格搜索(每组超参数 `20` 个 `epoch`)后,选择的超参数为:
+
+| Hyperparameter | Value |
+| :--------: | :-----: |
+| hidden_size | 1000 |
+| batch_size | 30 |
+| learning_rate | 0.0001 |
+
+模型学习过程中的损失 `loss` 变化如下:
+
+
+
+拟合函数情况如下:
+
+
+
+> lab_number = 11
+
+对于这个函数,一开始网格搜索(每组超参数 `6` 个 `epoch`)时,训练结果并不理想,原因在于 `6` 个 `epoch` 训练不足,较小的学习率(梯度下降的 `step` )收敛较慢,表现不好;(在之前的实验中,网格搜索也都选择了最大的学习率);而由于该函数波动比较大,学习率不能过大,否则很容易跳过最优点。
+
+因而我们调整了网格搜索中每组超参数训练的 `epoch` 数量,最终得到了更好的结果:
+$$R^2 = 0.99973780$$
+非常接近于 $1$ 。
+
+
+2. $$f(x) = \sum_{n=0}^{10} {\frac{1}{2^n} \cos (17^n \pi x)}$$
+
+这是 [Weierstrass function](https://en.wikipedia.org/wiki/Weierstrass_function "Weierstrass function") 的部分和。
+
+具体的,令 $a \in (0,1)$ , $b$ 为奇数,且 $ab > 1 + \frac{3 \pi}{2}$ ,则 $\text{Weierstrass}$ 函数
+$$F(x) = \sum_{n=0}^{\infty} {a^n \cos (b^n \pi x)}$$
+绝对收敛,且处处连续,处处不可导.
+
+这里选择 $a = \frac{1}{2}, \space b = 17, $ 考察神经网络对级数部分和的拟合情况:
+$$f(x) = \sum_{n=0}^{10} {(\frac{1}{2})^n \cos (17^n \pi x)}$$
+
+
+***实现方法1***
+
+
+
+> lab_number = 6
+
+***实现方法2***
+
+
+经过网格搜索(每组超参数 `6` 个 `epoch`)后,选择的超参数为:
+
+| Hyperparameter | Value |
+| :--------: | :-----: |
+| hidden_size | 1000 |
+| batch_size | 90 |
+| learning_rate | 0.001 |
+
+
+模型学习过程中的损失 `loss` 变化如下:
+
+
+
+拟合函数情况如下:
+
+
+
+> lab_number = 12
+
+最终的损失
+$$R^2 = 0.10848555.$$
+
+这个函数由于波动太剧烈,非常难学习。
+
+对于 ***实现方法1*** ,通过加大 `bump` 的数量 $N$ ,可以更好的拟合这个函数;
+
+但对于 ***实现方法2*** ,我们改变了网格搜索时训练的 `epoch` 数量、样本数据大小、学习率等超参数,仍然难以很好的进行拟合,说明当遇到这种波动非常大,甚至在极端情况下处处不可导的函数时,通过神经网络进行拟合还是有一定困难的。
+
+
+#### 其他函数
+我们进一步挑选了一些 `pathological functions (病态函数)` 尝试拟合。
+
+1. Mollifier function
+$$f(x) = \begin{cases} \exp\\{- \frac{1}{1 - |x|^2}\\}, && |x| < 1 \\\\ 0, && |x| \ge 1 \end{cases}, $$
+
+[Mollifier](https://en.formulasearchengine.com/wiki/Mollifier "Mollifier") 函数无穷阶可导,但在 $|x| = 1$ 处导数消失,因而不解析,其展开的 $\text{Taylor}$ 级数不收敛于该函数。
+
+***实现方法1***
+
+
+
+> lab_number = 7
+
+***实现方法2***
+
+
+经过网格搜索(每组超参数 `6` 个 `epoch`)后,选择的超参数为:
+
+| Hyperparameter | Value |
+| :--------: | :-----: |
+| hidden_size | 100 |
+| batch_size | 60 |
+| learning_rate | 0.01 |
+
+模型学习过程中的损失 `loss` 变化如下:
+
+
+
+拟合函数情况如下:
+
+
+
+> lab_number = 13
+
+最终的损失
+$$R^2 = 0.99987666$$
+
+
+2. Non-analytic smooth function
+$$f(x) = \begin{cases} \exp \\{- \frac{1}{x}\\}, && x > 0 \\\\ 0, && x \le 0 \end{cases}$$
+
+[Non-analytic smooth function](https://en.wikipedia.org/wiki/Non-analytic_smooth_function#A_smooth_function_which_is_nowhere_real_analytic "Non-analytic smooth function") 在实数轴上处处无穷阶可导且导数连续,但不解析,由于其原点处的导数为 $0$ ,该点处的 $\text{Taylor}$ 级数收敛于 $0$ ,即:
+$$\sum_{n=0}^{\infty} {\frac{f^{(n)}(0)}{n!} x^n} = \sum_{n=0}^{\infty}{\frac{0}{n!} x^n} = 0, \space x \in \mathcal{R}$$
+
+***实现方法1***
+
+
+
+> lab_number = 8
+
+***实现方法2***
+
+
+经过网格搜索(每组超参数 `6` 个 `epoch`)后,选择的超参数为:
+
+| Hyperparameter | Value |
+| :--------: | :-----: |
+| hidden_size | 100 |
+| batch_size | 30 |
+| learning_rate | 0.01 |
+
+模型学习过程中的损失 `loss` 变化如下:
+
+
+
+
+拟合函数情况如下:
+
+
+
+> lab_number = 14
+
+最终的损失
+$$R^2 = 0.99988132$$
+
+可以看到,对于这种光滑的函数,即使不解析,不能通过 $\text{Taylor}$ 级数近似,神经网络的拟合情况依然很好。
+
+
+## 总结
+- 第一部分通过理论证明了,双层 $\text{ReLU}$ 神经网络具有近似任意有界闭集上的连续函数的能力。
+
+- 而后尝试了两种实现的方法:
+ - `split into bumps` ,即将区间分割成 $N$ 个子区间,而后对每个子区间上的函数值进行拟合,即拟合不同的 `bump` 。
+
+ - 拟合的具体方法为 $$4 \times \text{ReLU neuron} \longrightarrow 2 \times \text{squashing function} \longrightarrow 1 \times \text{bump}.$$
+
+ - 因而隐藏层的神经元( $4N$ )越多,对函数拟合得越好;
+
+ - 这种方法便于理解,可解释性较强,但需要极大数量的隐藏层神经元来对函数进行拟合,没有充分利用全连接神经网络中的参数。
+
+ - `Backpropagation` ,即常用的训练神经网络模型的方法,通过标注数据以及给定的损失函数,自动训练模型的参数
+
+ - 这种方法更充分的利用了模型中的参数,通过较少的隐藏层神经元实现拟合;
+
+ - 在训练过程中,参数的初始化、训练时的学习率等都会对模型的拟合产生很大的影响;如何更好的训练模型,使模型更高效的收敛,还有很大探究的空间;
+
+ - 对于波动较为剧烈的函数,通过 `Backpropagation` 的方法较难拟合;
+
+ - 通过这种方法训练时,不同的超参数、参数对模型表现的影响较难分析,模型可解释性较差。
+
+
+## Reference
+
+[1]. [Hornik et al., 1989. Multilayer Feedforward Networks are Universal Approximators.](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators")
+
+[2]. [Michael Nielsen, 2019. Neural Networks and Deep Learning.](http://neuralnetworksanddeeplearning.com/chap4.html "Neural Networks and Deep Learning")
+
+[3]. [K. He, X. Zhang, S. Ren and J. Sun, 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.](https://arxiv.org/pdf/1502.01852.pdf "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification")
+
+[4]. [Weierstrass Approximation Theorem.](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem#Weierstrass_approximation_theorem "Weierstrass Approximation Theorem")
+
+[5]. [Mollifier function.](https://en.formulasearchengine.com/wiki/Mollifier "Mollifier")
+
+[6]. [Non-analytic smooth function.](https://en.wikipedia.org/wiki/Non-analytic_smooth_function#A_smooth_function_which_is_nowhere_real_analytic "Non-analytic smooth function")
+
+[7]. [Weierstrass function.](https://en.wikipedia.org/wiki/Weierstrass_function "Weierstrass function")
+
+[8]. [Riemann integral.](https://en.wikipedia.org/wiki/Riemann_integral "Riemann integral")
+
+
+## 代码使用方法
+```
+python source.py lab_number #lab_number = 1, 2, ..., 14
+ e.g., python source.py 1
+```
\ No newline at end of file
diff --git a/assignment-2/submission/18300110042/img/bump_model.png b/assignment-2/submission/18300110042/img/bump_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..b163c5d1ac0ca1bcd383c1c2c220069d1f56446d
Binary files /dev/null and b/assignment-2/submission/18300110042/img/bump_model.png differ
diff --git a/assignment-2/submission/18300110042/img/lab01.png b/assignment-2/submission/18300110042/img/lab01.png
new file mode 100644
index 0000000000000000000000000000000000000000..7b048b4ddaffa996adb312786b1f8adbea953f39
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab01.png differ
diff --git a/assignment-2/submission/18300110042/img/lab02.png b/assignment-2/submission/18300110042/img/lab02.png
new file mode 100644
index 0000000000000000000000000000000000000000..249c273af2d5b1322aaeb2b5455feaa318030f1e
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab02.png differ
diff --git a/assignment-2/submission/18300110042/img/lab03.png b/assignment-2/submission/18300110042/img/lab03.png
new file mode 100644
index 0000000000000000000000000000000000000000..8f1bb896a573b309a0d60b52172c6d87777d6ce4
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab03.png differ
diff --git a/assignment-2/submission/18300110042/img/lab04.png b/assignment-2/submission/18300110042/img/lab04.png
new file mode 100644
index 0000000000000000000000000000000000000000..3f2b420672a840f8396f1325ba0f559af9aba757
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab04.png differ
diff --git a/assignment-2/submission/18300110042/img/lab05.png b/assignment-2/submission/18300110042/img/lab05.png
new file mode 100644
index 0000000000000000000000000000000000000000..1922301cd722f868955da6882914b67d64c0c8ee
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab05.png differ
diff --git a/assignment-2/submission/18300110042/img/lab06.png b/assignment-2/submission/18300110042/img/lab06.png
new file mode 100644
index 0000000000000000000000000000000000000000..6806f3108bac9280a260da9824ee1a42a280de20
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab06.png differ
diff --git a/assignment-2/submission/18300110042/img/lab07.png b/assignment-2/submission/18300110042/img/lab07.png
new file mode 100644
index 0000000000000000000000000000000000000000..dabb0e653c791eced71ca3cf466ec9a7e3fa736d
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab07.png differ
diff --git a/assignment-2/submission/18300110042/img/lab08.png b/assignment-2/submission/18300110042/img/lab08.png
new file mode 100644
index 0000000000000000000000000000000000000000..a2e9f57904639f48d29641eb6b9155bd39556dc7
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab08.png differ
diff --git a/assignment-2/submission/18300110042/img/lab09.png b/assignment-2/submission/18300110042/img/lab09.png
new file mode 100644
index 0000000000000000000000000000000000000000..1939508d083254cd8292fbd4ca705716ea222de2
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab09.png differ
diff --git a/assignment-2/submission/18300110042/img/lab09_loss.png b/assignment-2/submission/18300110042/img/lab09_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..c578d1c61a1fa6c2b3dd72dbb0ef98d59dc0558b
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab09_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab10.png b/assignment-2/submission/18300110042/img/lab10.png
new file mode 100644
index 0000000000000000000000000000000000000000..f7cbd990e0eaf241702568e73861959594fc9146
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab10.png differ
diff --git a/assignment-2/submission/18300110042/img/lab10_loss.png b/assignment-2/submission/18300110042/img/lab10_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..74ddd6a1a0f93455f1a5e2235a37f7acae496408
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab10_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab11.png b/assignment-2/submission/18300110042/img/lab11.png
new file mode 100644
index 0000000000000000000000000000000000000000..bac94610d4890e9d106c0a9d3c79bde4a358f694
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab11.png differ
diff --git a/assignment-2/submission/18300110042/img/lab11_loss.png b/assignment-2/submission/18300110042/img/lab11_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..eabb70779d0bba17a25f755a8395f8120b8c7b2c
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab11_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab12.png b/assignment-2/submission/18300110042/img/lab12.png
new file mode 100644
index 0000000000000000000000000000000000000000..a5cdf8a04098f453dd1ae9053ada6c5bbb238780
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab12.png differ
diff --git a/assignment-2/submission/18300110042/img/lab12_loss.png b/assignment-2/submission/18300110042/img/lab12_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..672faf1adcdd6ff97d55a88cf03e43223f963f08
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab12_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab13.png b/assignment-2/submission/18300110042/img/lab13.png
new file mode 100644
index 0000000000000000000000000000000000000000..4e418b9a656be8f0b8624067923fe33a25897cf7
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab13.png differ
diff --git a/assignment-2/submission/18300110042/img/lab13_loss.png b/assignment-2/submission/18300110042/img/lab13_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..479460dfb7d5262536277da01af1d182ef798fde
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab13_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab14.png b/assignment-2/submission/18300110042/img/lab14.png
new file mode 100644
index 0000000000000000000000000000000000000000..5d5df09671f88a3ef56dbcec2aa824c479af4f30
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab14.png differ
diff --git a/assignment-2/submission/18300110042/img/lab14_loss.png b/assignment-2/submission/18300110042/img/lab14_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..dd9c4c3ef9c89dabb7de5274a8d3e477e5e32442
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab14_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/nn_model.png b/assignment-2/submission/18300110042/img/nn_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..964f1d1b689cd46847c73bc6f683dad36b9c9fa9
Binary files /dev/null and b/assignment-2/submission/18300110042/img/nn_model.png differ
diff --git a/assignment-2/submission/18300110042/img/squashing_function_ReLU.png b/assignment-2/submission/18300110042/img/squashing_function_ReLU.png
new file mode 100644
index 0000000000000000000000000000000000000000..482a57daecc727c979a033bef4d24328b800476a
Binary files /dev/null and b/assignment-2/submission/18300110042/img/squashing_function_ReLU.png differ
diff --git a/assignment-2/submission/18300110042/source.py b/assignment-2/submission/18300110042/source.py
new file mode 100644
index 0000000000000000000000000000000000000000..0977e20798c9de00510147bdd009ffe4efe63072
--- /dev/null
+++ b/assignment-2/submission/18300110042/source.py
@@ -0,0 +1,543 @@
+import sys, time
+import numpy as np
+import matplotlib.pyplot as plt
+#from mpl_toolkits import mplot3d
+
+class MatMul:
+ def __init__(self, W):
+ self.params = [W]
+ self.grads = [np.zeros_like(W)]
+ self.x = None
+
+ def forward(self, x):
+ W, = self.params
+ out = np.dot(x, W)
+ self.x = x
+ return out
+
+ def backward(self, dout):
+ W, = self.params
+ dx = np.dot(dout, W.T)
+ dW = np.dot(self.x.T, dout)
+ self.grads[0][...] = dW
+ return dx
+
+class Affine:
+ def __init__(self, W, b):
+ self.params = [W, b]
+ self.grads = [np.zeros_like(W), np.zeros_like(b)]
+ self.x = None
+
+ def forward(self, x):
+ W, b = self.params
+ out = np.dot(x, W) + b
+ self.x = x
+ return out
+
+ def backward(self, dout):
+ W, b = self.params
+ dx = np.dot(dout, W.T)
+ dW = np.dot(self.x.T, dout)
+ db = np.sum(dout, axis=0)
+ self.grads[0][...] = dW
+ self.grads[1][...] = db
+ return dx
+
+class ReLU:
+ def __init__(self):
+ self.params, self.grads = [], []
+ self.x = None
+
+ def forward(self, x):
+ self.mask = (x <= 0)
+ out = x.copy()
+ out[self.mask] = 0
+ return out
+
+ def backward(self, dout):
+ dout[self.mask] = 0
+ dx = dout
+ return dx
+
+class IdentityWithMSELoss:
+ '''
+ identity neuron node with mean square error(MSE) loss
+ '''
+ def __init__(self):
+ self.params, self.grads = [], []
+
+ def forward(self, x, t):
+ self.y = x
+ self.t = t
+ loss = 0.5 * np.mean(np.sum(np.square(x - t), axis=1))
+ return loss
+
+ def backward(self, dout=1):
+ batch_size = self.t.shape[0]
+ dx = (self.y - self.t) * dout / batch_size
+ return dx
+
+class IdentityWithHuberLoss:
+ '''
+ identity neuron node with Huber error loss
+ '''
+ def __init__(self, beta=1.0):
+ self.params, self.grads = [], []
+ self.beta = beta
+
+ def forward(self, x, t):
+ self.y = x
+ self.t = t
+ l1 = np.sum(self.beta * (np.abs(x - t) - 0.5 * self.beta), axis=1)
+ l2 = 0.5 * np.sum(np.square(x - t), axis=1)
+ l1_mask = (x - t >= self.beta).astype(int)
+ l2_mask = 1 - l1_mask
+ loss = np.mean(l1 * l1_mask + l2 * l2_mask)
+ return loss
+
+ def backward(self, dout=1):
+ batch_size = self.t.shape[0]
+ d_l1 = self.beta * np.abs(self.y - self.t) * dout / batch_size
+ d_l2 = (self.y - self.t) * dout / batch_size
+ l1_mask = (self.y - self.t >= self.beta).astype(int)
+ l2_mask = 1 - l1_mask
+ dx = d_l1 * l1_mask + d_l2 * l2_mask
+ return dx
+
+class SGD:
+ '''
+ stochastic gradient descent(SGD) optimization
+ '''
+ def __init__(self, lr=0.001):
+ self.lr = lr
+
+ def update(self, params, grads):
+ for i in range(len(params)):
+ params[i] -= self.lr * grads[i]
+
+class Adam:
+ '''
+ adaptive moment(Adam) optimization
+ '''
+ def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
+ self.lr = lr
+ self.beta1 = beta1
+ self.beta2 = beta2
+ self.iter = 0
+ self.m = None
+ self.v = None
+
+ def update(self, params, grads):
+ if self.m is None:
+ self.m, self.v = [], []
+ for param in params:
+ self.m.append(np.zeros_like(param))
+ self.v.append(np.zeros_like(param))
+ self.iter += 1
+ lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)
+ for i in range(len(params)):
+ self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i])
+ self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i])
+ params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)
+
+class TwoReLULayerNet:
+ def __init__(self, input_size=1, hidden_size=10, output_size=1,
+ weight_scale='he', loss_func='huber'):
+ '''
+ weight_scale: scale value used to initialize weight.
+ there are 3 possible value 'he', 'xavier' and 'std'
+ 'he': set sqrt(2/n) as weight init scale, proposed by Kaiming He
+ 'xavier': set sqrt(1/n) as weight init scale, proposed by Xavier Glorot
+ 'std': set 0.01 as weight init scale
+ loss_func: loss function type. there are 2 possible value here, 'huber' and 'mse'
+ 'huber': Huber error loss
+ 'mse': MSE loss
+ '''
+ self.input_size = input_size
+ self.hidden_size = hidden_size
+ self.output_size = output_size
+ self.weight_scale = weight_scale
+ self.loss_func = loss_func
+ # init w and b
+ scale = self.get_weight_scale(input_size)
+ W1 = scale * np.random.randn(input_size, hidden_size)
+ b1 = np.random.randn(hidden_size)
+ scale = self.get_weight_scale(hidden_size)
+ W2 = scale * np.random.randn(hidden_size, output_size)
+ # init layers
+ self.layers = [Affine(W1, b1), ReLU(), MatMul(W2)]
+ if loss_func == 'huber':
+ self.loss_layer = IdentityWithHuberLoss()
+ else: # mse loss function
+ self.loss_layer = IdentityWithMSELoss()
+ # init parameters and gradients
+ self.params, self.grads = [], []
+ for layer in self.layers:
+ self.params += layer.params
+ self.grads += layer.grads
+
+ def get_weight_scale(self, node_size):
+ scale = 0.01
+ if self.weight_scale == 'he':
+ scale = np.sqrt(2.0 / node_size)
+ elif self.weight_scale == 'xavier':
+ scale = np.sqrt(2.0 / node_size)
+ return scale
+
+ def predict(self, x):
+ for layer in self.layers:
+ x = layer.forward(x)
+ return x
+
+ def forward(self, x, t):
+ y = self.predict(x)
+ loss = self.loss_layer.forward(y, t)
+ return loss
+
+ def backward(self, dout=1):
+ dout = self.loss_layer.backward(dout)
+ for layer in reversed(self.layers):
+ dout = layer.backward(dout)
+ return dout
+
+class Trainer:
+ def __init__(self, model, optimizer):
+ self.model = model
+ self.optimizer = optimizer
+ self.loss_list = []
+ self.eval_interval = None
+
+ def fit(self, x, t, max_epoch=10, batch_size=36, eval_interval=20):
+ data_size = len(x)
+ max_iters = data_size // batch_size
+ self.eval_interval = eval_interval
+ model, optimizer = self.model, self.optimizer
+ epoch_loss, epoch_loss_count = 0, 0
+ start_time = time.time()
+ for epoch in range(max_epoch):
+ idx = np.random.permutation(np.arange(data_size))
+ x = x[idx]
+ t = t[idx]
+ iter_loss, iter_loss_count = 0, 0
+ for iter in range(max_iters):
+ batch_x = x[iter*batch_size:(iter+1)*batch_size]
+ batch_t = t[iter*batch_size:(iter+1)*batch_size]
+ loss = model.forward(batch_x, batch_t)
+ model.backward()
+ optimizer.update(model.params, model.grads)
+ epoch_loss += loss
+ epoch_loss_count += 1
+ iter_loss += loss
+ iter_loss_count += 1
+ if (eval_interval is not None) and (iter % eval_interval) == 0:
+ iter_avg_loss = iter_loss / iter_loss_count
+ self.loss_list.append(float(iter_avg_loss))
+ iter_loss, iter_loss_count = 0, 0
+ epoch_avg_loss = epoch_loss / epoch_loss_count
+ elapsed_time = time.time() - start_time
+ print('| epoch %d | time %d[s] | loss %.2f' % (epoch + 1, elapsed_time, epoch_avg_loss))
+ epoch_loss, epoch_loss_count = 0, 0
+
+ def plot(self, ylim=None):
+ x = np.arange(len(self.loss_list))
+ if ylim is not None:
+ plt.ylim(*ylim)
+ plt.plot(x, self.loss_list, label='train')
+ plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
+ plt.ylabel('loss')
+ plt.show()
+
+def grid_search_cv(x, t, param_grid, max_epoch=5):
+ input_size = x.shape[1]
+ output_size = t.shape[1]
+ hidden_size, weight_scale, loss_func, batch_size, learning_rate = np.meshgrid(
+ param_grid['hidden_size'],
+ param_grid['weight_scale'],
+ param_grid['loss_func'],
+ param_grid['batch_size'],
+ param_grid['learning_rate'])
+ params = [p for p in zip(hidden_size.flat, weight_scale.flat,
+ loss_func.flat, batch_size.flat, learning_rate.flat)]
+ min_score = 99999999
+ best_param = {}
+ for param in params:
+ print(param)
+ # split train data and validation data
+ (x_train, t_train), (x_val, t_val) = split_dataset(x, t, 0.2)
+ # train model with train data
+ model = TwoReLULayerNet(input_size, param[0], output_size, param[1], param[2])
+ optimizer = Adam(param[4])
+ trainer = Trainer(model, optimizer)
+ trainer.fit(x_train, t_train, max_epoch, param[3], None)
+ # evaluate model with validation data
+ score = model.forward(x_val, t_val)
+ print('score: ', score)
+ if score < min_score:
+ min_score = score
+ best_param['hidden_size'] = param[0]
+ best_param['weight_scale'] = param[1]
+ best_param['loss_func'] = param[2]
+ best_param['batch_size'] = param[3]
+ best_param['learning_rate'] = param[4]
+ return best_param
+
+def split_dataset(x, t, rate=0.2):
+ N = x.shape[0]
+ idx = np.random.permutation(np.arange(N))
+ x = x[idx]
+ t = t[idx]
+ split_idx = int(N * rate)
+ x_train = x[split_idx:]
+ t_train = t[split_idx:]
+ x_test = x[:split_idx]
+ t_test = t[:split_idx]
+ return (x_train, t_train), (x_test, t_test)
+
+def load_data(func, params):
+ # parse func parameters
+ x = params.get('x', np.array([]))
+ if len(x) == 1:
+ x = x[0]
+ else:
+ d_grid = np.meshgrid(*x)
+ x = np.array([p for p in zip(*[d.flat for d in d_grid])])
+ kwargs = params.get('kwargs', np.array([]))
+ # execute function to get 't'
+ t = eval(func)
+ # reshape x and t
+ N = x.shape[0]
+ x = x.reshape(N, -1)
+ N = t.shape[0]
+ t = t.reshape(N, -1)
+ return x, t
+
+def relu_func(x):
+ y = np.zeros_like(x)
+ y[...] = x
+ y[x < 0] = 0
+ return y
+
+def bump_func_1d(x, x1, x2, h):
+ y = np.ones_like(x) * h
+ y[x < x1] = 0
+ y[x > x2] = 0
+ return y
+
+def approx_bump_1d(x, x1, x2, h, d):
+ y1 = d * h * relu_func(x - x1) / (x2 - x1)
+ y2 = -d * h * relu_func(x - (x1 + (x2 - x1) / d)) / (x2 - x1)
+ y3 = -d * h * relu_func(x - (x2 - (x2 - x1) / d)) / (x2 - x1)
+ y4 = d * h * relu_func(x - x2) / (x2 - x1)
+ return y1 + y2 + y3 + y4
+
+def rsqare(x, t):
+ mse = np.mean(np.sum(np.square(x - t), axis=1))
+ loss = 1 - mse / np.var(t)
+ return loss
+
+def approx_func3(x):
+ sum = np.zeros_like(x)
+ for i in range(10):
+ t = np.sin((2 * i + 1) * x) / (2 * i + 1)
+ sum += t
+ return 0.5 - 2 * sum / np.pi
+
+def approx_func4(x):
+ sum = np.zeros_like(x)
+ for i in range(10):
+ t = np.cos((np.power(17,i) * np.pi * x) / np.power(2,i))
+ sum += t
+ return sum
+
+def approx_func5(x):
+ t = np.exp(-1 / (1 - np.square(x)))
+ t[np.abs(x) >= 1] = 0
+ return t
+
+def approx_func6(x):
+ t = np.exp(-1 / x)
+ t[x <= 0] = 0
+ return t
+
+max_epoch = 50
+eval_interval = 100
+data_size = 10000
+x_begin, x_end = -2, 2
+param_grid = {
+ 'hidden_size': [10, 100, 1000],
+ 'batch_size': [30, 60, 90],
+ 'learning_rate': [1e-2, 1e-4, 1e-6],
+ 'weight_scale': ['he'], #['he', 'xavier', 'std'],
+ 'loss_func': ['huber'] #['huber', 'mse']
+}
+l_config = [
+ { # 1
+ 'func': 'relu_func(x + 0.5) - relu_func(x - 0.5)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)], 'kwargs': {'x1': -1, 'x2': 1, 'h': 5, 'd': 1000}},
+ 'approx': 'draw',
+ 'title': 'A squashing function constructed with ReLU'
+ },
+ { # 2
+ 'func': 'approx_bump_1d(x, **kwargs)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)], 'kwargs': {'x1': -1, 'x2': 1, 'h': 5, 'd': 1000}},
+ 'approx': 'draw',
+ 'title': ''
+ },
+ { # 3
+ 'func': '1 - x + np.power(x,2) - np.power(x,3) + np.power(x,4) - np.power(x,5)/5 + np.power(x,6)/6 - np.power(x,7)/7',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+ 'approx': 'split',
+ 'split_num': 30,
+ 'title': ''
+ },
+ { # 4
+ 'func': 'x - np.power(x,3)/3 + np.power(x,5)/5 - np.power(x,7)/7 + np.power(x,9)/9 - np.power(x,11)/11',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+ 'approx': 'split',
+ 'split_num': 30,
+ 'title': ''
+ },
+ { # 5
+ 'func': 'approx_func3(x)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+ 'approx': 'split',
+ 'split_num': 30,
+ 'title': ''
+ },
+ { # 6
+ 'func': 'approx_func4(x)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size*10000)]},
+ 'approx': 'split',
+ 'split_num': 300,
+ 'title': ''
+ },
+ { # 7
+ 'func': 'approx_func5(x)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+ 'approx': 'split',
+ 'split_num': 30,
+ 'title': ''
+ },
+ { # 8
+ 'func': 'approx_func6(x)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+ 'approx': 'split',
+ 'split_num': 30,
+ 'title': ''
+ },
+ { # 9
+ 'func': '1 - x + np.power(x,2) - np.power(x,3) + np.power(x,4) - np.power(x,5)/5 + np.power(x,6)/6 - np.power(x,7)/7',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size)]},
+ 'approx': 'dnn',
+ 'title': ''
+ },
+ { # 10
+ 'func': 'x - np.power(x,3)/3 + np.power(x,5)/5 - np.power(x,7)/7 + np.power(x,9)/9 - np.power(x,11)/11',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size)]},
+ 'approx': 'dnn',
+ 'title': ''
+ },
+ { # 11
+ 'func': 'approx_func3(x)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size)]},
+ 'approx': 'dnn',
+ 'title': ''
+ },
+ { # 12
+ 'func': 'approx_func4(x)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size)]},
+ 'approx': 'dnn',
+ 'title': ''
+ },
+ { # 13
+ 'func': 'approx_func5(x)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size)]},
+ 'approx': 'dnn',
+ 'title': ''
+ },
+ { # 14
+ 'func': 'approx_func6(x)',
+ 'params': {'x': [np.linspace(x_begin, x_end, data_size)]},
+ 'approx': 'dnn',
+ 'title': ''
+ },
+]
+
+if __name__ == '__main__':
+ max_lab = len(l_config)
+
+ # parse user input
+ if len(sys.argv) > 1 and sys.argv[1].isnumeric():
+ lab_num = int(sys.argv[1])
+ else:
+ print('Usage: python source lab_num (1 ~ {0})'.format(max_lab))
+ sys.exit(-1)
+
+ if lab_num > max_lab or lab_num < 1:
+ print('Usage: python source lab_num (1 ~ {0})'.format(max_lab))
+ sys.exit(-1)
+
+ # prepare dataset
+ func = l_config[lab_num-1]['func']
+ params = l_config[lab_num-1]['params']
+ approx = l_config[lab_num-1]['approx']
+ title = l_config[lab_num-1]['title']
+ x_data, t_data = load_data(func, params)
+
+ dim = x_data.shape[1]
+ if dim > 1:
+ #print('Cannot draw diagram with dimension higher tha 3.')
+ print('Only implement senario with dimension 1')
+ exit(-1)
+
+ # approximate function
+ if approx == 'dnn':
+ (x_train, t_train), (x_test, t_test) = split_dataset(x_data, t_data)
+ # grid search best hyperparameters
+ best_param = grid_search_cv(x_train, t_train, param_grid, max_epoch=6)
+ # create neuron network to approximate target function
+ model = TwoReLULayerNet(x_train.shape[1],
+ best_param['hidden_size'],
+ t_train.shape[1],
+ best_param['weight_scale'],
+ best_param['loss_func'])
+ optimizer = Adam(best_param['learning_rate'])
+ trainer = Trainer(model, optimizer)
+ trainer.fit(x_train, t_train, max_epoch, best_param['batch_size'], eval_interval)
+ # evaluate model with test data
+ print('best param: ', best_param)
+ t_predict = model.predict(x_test)
+ test_loss = rsqare(t_predict, t_test)
+ print('| final test | loss %.8f' % (test_loss))
+ # draw diagram
+ trainer.plot()
+ plt.plot(x_data, t_data)
+ plt.plot(x_data, model.predict(x_data), color='coral')
+ elif approx == 'split':
+ # split x for each dimention
+ plt.plot(x_data, t_data)
+ split_num = l_config[lab_num-1]['split_num']
+ x_split = np.linspace(x_begin, x_end, split_num+1)
+ x_sample = np.array([(x_split[i] + x_split[i+1]) / 2 for i in range(len(x_split)-1)])
+ xs = np.array_split(x_data, split_num)
+ t_bumps = np.array([])
+ for i in range(split_num):
+ # get h
+ params['x'] = [x_sample[i:i+1]]
+ _, h = load_data(func, params)
+ # get x1, x2
+ x1 = x_split[i]
+ x2 = x_split[i+1]
+ # load bump function data
+ bump_params = {}
+ bump_params['x'] = [xs[i]]
+ bump_params['kwargs'] = {'x1': x1, 'x2': x2, 'h': h[0], 'd': 1000}
+ bump_func = 'approx_bump_1d(x, **kwargs)'
+ x_bump, t_bump = load_data(bump_func, bump_params)
+ t_bumps = np.concatenate((t_bumps, t_bump.flatten()))
+ plt.plot(x_bump, t_bump, color='coral')
+ else:
+ plt.plot(x_data, t_data, color='coral')
+ plt.title(title)
+ plt.grid()
+ plt.show()
\ No newline at end of file