diff --git a/assignment-2/submission/18300110042/README.md b/assignment-2/submission/18300110042/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2aedb1aa0ec1150261a4b0b8669fd7044588a416
--- /dev/null
+++ b/assignment-2/submission/18300110042/README.md
@@ -0,0 +1,606 @@
+# 课程报告
+这是`prml-21-spring/assignment-2` 中 `选题2`的课程报告，我的代码在 [source.py](source.py) 中。
+
+报告的内容是理论和实验证明双层 $\text{ReLU}$ 神经网络能够拟合任意有界闭集上的连续函数。
+
+这次报告的主要内容如下：
+
+1. 理论证明
+2. 实验证明
+    - 神经网络模型
+    - 实现方法1 (split into bumps)
+    - 实现方法2 (back propagation)
+    - 拟合三类函数
+        - 多项式函数
+        - 三角多项式函数
+        - 其他函数
+
+
+## 理论证明
+首先，定义如下函数：
+
+$$\Psi_{Relu}(t) = \text{ReLU}(t + 0.5) - \text{ReLU}(t - 0.5).$$
+
+其中 $t \in \mathcal{R}$; $\text{ReLU}(t) = \max(0, t)$, 
+
+对于 $\Psi_{Relu}(t)$, 也可以看做一个分段线性的函数：
+
+$$\Psi_{Relu}(t) = \begin{cases} 0, && t \le -0.5 \\\\ t + 0.5, && -0.5 < t < 0.5 \\\\ 1, && t \ge 0.5 \end{cases} .$$
+
+如图：
+
+<img src="img/lab01.png" alt="A squashing function constructed with ReLU" width="600"/>
+
+> lab_number = 1
+
+$\Psi_{Relu}: \mathcal{R} \to [0,1]$ 满足条件：
+
+1. 连续（continuous）非减（non-decreasing）函数；
+2. $\lim_{\lambda \to +\infty} \Psi_{Relu}(\lambda) = 1$;
+3. $\lim_{\lambda \to -\infty} \Psi_{Relu}(\lambda) = 0$.
+
+根据通用近似定理 [Hornik et al., 1989](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators") 中的 `Definition 2.3`，$\Psi_{Relu}(\cdot)$ 是一个 `squashing function` （挤压函数） .
+
+由【[Hornik et al., 1989](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators")】`Theorem 2.4`，通用近似定理在有界闭集上成立，对于一个 `squashing function` $\Psi(\cdot)$ ，令 $\mathcal{I}_r$ 为 $\mathcal{R}^r$ 上的任一有界闭集， $\mathcal{C}(\mathcal{I}_r)$ 为 $\mathcal{I}_r$ 中连续函数的集合，则函数集合
+
+$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{N} \alpha_{j} \Psi(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j), \space \boldsymbol{w} \in \mathcal{R}^r, \alpha\_j, b\_j \in \mathcal{R}, N \in \mathcal{N_+} \\}$$ 
+
+在 $\mathcal{C}(\mathcal{I}_r)$ 上是一致稠密的（uniformly dense）。
+
+也就是说，
+
+对于 $\forall f \in \mathcal{C}(\mathcal{I}_r), \space \exists N \in \mathcal{N\_+}, \space \alpha_j, b_j \in \mathcal{R}, \space \boldsymbol{w}_j \in \mathcal{R}^r, (j = 1, \cdots, N)$ ，使得函数 
+
+$$G(\boldsymbol{x}) = \sum_{j=1}^{N} \alpha_{j} \Psi(\boldsymbol{w}_j^T \boldsymbol{x} + b_j)$$ 
+
+可以以任意精度近似函数 $f$ ，即：
+
+对于 $\forall \epsilon > 0$ ，有
+
+$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{N} {\alpha_{j} \Psi(\boldsymbol{w}_j^T \boldsymbol{x} + b_j)} - f(\boldsymbol{x})| < \epsilon, \space \forall \boldsymbol{x} \in \mathcal{I}_r.$$ 
+
+将 $\Psi_{Relu}(\boldsymbol{x}) = \text{ReLU}(\boldsymbol{x} + 0.5) - \text{ReLU}(\boldsymbol{x} - 0.5)$ 代入上式，即可得：
+
+$$\begin{align} |G(\boldsymbol{x}) - f(\boldsymbol{x})| &= |\sum_{j=1}^{N} {\alpha_{j} \Psi\_{Relu}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j)} - f(\boldsymbol{x})| \\\\ &= |\sum_{j=1}^{N}{\alpha_j [\text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j + 0.5) - \text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j - 0.5)]} - f(\boldsymbol{x})| \\\\ &=|\sum_{j=1}^{N}{\alpha_j \text{ReLU}(\boldsymbol{w}\_j^T \boldsymbol{x} + b\_j + 0.5)} - \sum_{j=1}^{N}{\alpha_j \text{ReLU}(\boldsymbol{w}_j^T \boldsymbol{x} + b_j - 0.5)} - f(\boldsymbol{x})| \\\\ < \epsilon, \end{align}$$
+
+
+进一步调整，可以得到：
+
+$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{2N} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)} - f(\boldsymbol{x})| < \epsilon, $$
+
+
+其中 
+$$\tilde{\alpha}_j = \begin{cases} \alpha_j, && 1 \le j \le N \\\\ -\alpha_j, && N < j \le 2N \end{cases}, $$
+
+$$\tilde{\boldsymbol{w}}_j = \begin{cases} \boldsymbol{w}_j, && 1 \le j \le N \\\\ \boldsymbol{w}\_{j-N}, && N < j \le 2N \end{cases}, $$
+
+$$\tilde{b}_j = \begin{cases} b_j + 0.5, && 1 \le j \le N \\\\ b_j - 0.5, && N < j \le 2N \end{cases}. $$ 
+
+记 $M = 2N \in \mathcal{N_+} $ ，即可得：
+
+$$G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)},$$
+
+于是，
+
+对于 $\forall \epsilon > 0, \space \forall f \in \mathcal{C}(\mathcal{I}_r), \space \exists M \in \mathcal{N\_+}, \space \alpha_j, b_j \in \mathcal{R}, \space \boldsymbol{w}_j \in \mathcal{R}^r, (j = 1, \cdots, N)$ ，使得
+
+$$|G(\boldsymbol{x}) - f(\boldsymbol{x})| = |\sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}_j^T \boldsymbol{x} + \tilde{b}_j)} - f(\boldsymbol{x})| < \epsilon, \space \forall \boldsymbol{x} \in \mathcal{I}_r. $$
+
+即集合
+
+$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}\_j^T \boldsymbol{x} + \tilde{b}\_j)}, \space \boldsymbol{w} \in \mathcal{R}^r, \alpha\_j, b\_j \in \mathcal{R}, M \in \mathcal{N_+}\\}$$ 
+
+在 $\mathcal{C}(\mathcal{I}_r)$ 上是一致稠密的。
+
+该集合正是两层 $\text{ReLU}$ 神经网络能够实现的所有函数的集合，其中 $M \in \mathcal{N_+}$ 即隐藏层的神经元数量。
+
+也就是说，对于两层 $\text{ReLU}$ 神经网络，我们一定能够通过有限个数（$M \in \mathcal{N_+}$）的隐藏层神经元，以任意精度近似任何一个定义在实数空间 $\mathcal{R}^r$ 中的有界闭集 $\mathcal{I}_r$ 上的连续函数。
+
+<p style='text-align: right;'> $\blacksquare$ </p>
+
+
+
+## 实验证明
+通过以上的理论证明，我们知道两层 $\text{ReLU}$ 神经网络理论上能够以任意精度近似每一个定义在实数空间 $\mathcal{R}^r$ 上有界闭集 $\mathcal{I}_r$ 上的连续函数。但在实践中，如何调整神经网络的参数来完成这个任务，还需要进一步的探讨。
+
+我们在下面尝试采用两种不同的方法来拟合给定的函数：
+
+1. 根据理论证明，通过几个 $\text{ReLU}$ 的神经元，在每一个给定的区间上对函数值进行拟合；
+2. 根据常用的神经网络调整参数的方法——梯度下降，拟合对应区间内给定的函数。
+
+目标是要尽可能好地拟合给定的函数。
+
+### 神经网络模型
+根据证明，我们设计如下神经网络：
+1. 输入 $\boldsymbol{x} \in \mathcal{R}^D$ ，经过一个 `affine` 层，得到 $$\boldsymbol{z} = \boldsymbol{x} \boldsymbol{w} + \boldsymbol{b} \in \mathcal{R}^M; $$
+
+2. 而后通过 `ReLU` 激活层，得到 $$\boldsymbol{h} = \text{ReLU} (\boldsymbol{z}) =  \max(\boldsymbol{z}, \boldsymbol{0}) \in \mathcal{R}^M; $$ 
+
+3. 最后再通过一个线性映射，得到最终的输出 $$y = \boldsymbol{h} \boldsymbol{\alpha} \in \mathcal{R}$$
+
+其中，$D, M \in \mathcal{N_+}$ ，$D$ 为输入的维度， $M$ 为隐藏层神经元个数， $\boldsymbol{w_1}$ 为 $D \times M$ 的矩阵， $\boldsymbol{w_2}$ 为 $M$ 维的向量。
+
+模型大致结构如下：
+
+<img src="img/nn_model.png" alt="Two-layer ReLU Neural Network model" width="350"/>
+
+这个神经网络能够实现的函数集合就是证明中的
+$$\\{G(\boldsymbol{x}) = \sum_{j=1}^{M} {\tilde{\alpha}\_j \text{ReLU}(\tilde{\boldsymbol{w}}\_j^T \boldsymbol{x} + \tilde{b}\_j)}, \space \boldsymbol{w} \in \mathcal{R}^D, \alpha\_j, b\_j \in \mathcal{R}, M \in \mathcal{N_+}\\}.$$
+
+我们的神经网络模型能够处理任意维度（ $D$ ）的输入，但在接下去的实验中，为了方便可视化，实现的是针对 $1$ 维输入的拟合，即 $\mathcal{R} \to \mathcal{R}$ 的函数。
+
+
+### 实现方法 1 —— split into bumps
+为了拟合一个定义在闭区间 $[a, b]$ 上函数 $F(x)$ ，可以将闭区间 $[a, b]$ 分割为有限个子区间（subinterval），即在 $[a, b]$ 间找到有限个点：
+
+$$a = x_0 < x_1 < x_2 < \cdots < x_n = b, $$
+
+形成 $N$ 个子区间 $\Delta = [x_i, x_{i+1}]$ ，对该区间上的函数值进行拟合。
+
+记该区间长度为 $\delta$ ，当 $\delta \to 0$ 时，我们就可以精确的近似函数 $F(x)$ .
+
+这和 [Riemann integral](https://en.wikipedia.org/wiki/Riemann_integral "Riemann integral") 的概念类似。
+
+具体而言，我们将给定的闭区间分割为等长的 $N$ 个子区间，令每一个子区间上的预测值等于该区间中点处的函数值。
+
+我们定义如下的 `bump function` ：
+$$b(x_1, x_2, h) = \begin{cases} 0, && x \le x_1 \\\\ h, && x_1 < x < x_2 \\\\ 0, && x \ge x_2 \end{cases}, $$
+
+其中 $x_1 < x_2$ .
+
+`bump` 的宽度是 $\delta = x_2 - x_1$ ，高度是 $h$ 。
+
+根据理论证明，我们可以通过两个 $\text{ReLU}$ 函数（即两个隐藏层的神经元）来实现一个挤压函数（`squashing function`）；而实际上，通过两个挤压函数就可以近似一个 `bump function` ，通过控制每一个的 `bump` 的高度和宽度，我们可以在任意小的区间 $\Delta$ 上通过 `bump` 来近似该区间对应的函数值。
+
+近似 `bump function` 的方法：
+
+通过以下 $4$ 个 $\text{ReLU}$ 函数近似给定的 `bump function` $b(x_1, x_2, h)$ ：
+
+$$\begin{align} y_1 &= \frac{d \times  h}{x_2 - x_1} \cdot \text{ReLU}(x - x_1) \\\\ y_2 &= - \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} [x - (x_1 +\frac{x_2 - x_1}{d})] \\\\ y_3 &= - \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} [x - (x_2 - \frac{x_2 - x_1}{d})] \\\\ y_4 &= \frac{d \times h}{x_2 - x_1} \cdot \text{ReLU} (x - x_2) \end{align}$$
+
+最终得到 $y = \sum_{i=1}^{4} {y_i}$ 作为 `bump function` 的近似。
+
+其中 $h$ 即 `bump function` 的高度； 
+
+$d$ 决定了 $\text{ReLU}$ 函数的斜率： `bump` 的宽度为 $x_2 - x_1$ ，而在其两侧的间断点 $x_1, x_2$ 处，$\text{ReLU}$ 函数需要通过斜率较大的直线近似，我们令这个区间的长度为 $\frac{x_2 - x_1}{d}$ 。
+
+于是，当 $d \to + \infty$ 时， $\text{ReLU}$ 函数的斜率趋近于 $+ \infty$ ，近乎垂直于横轴，能够无限接近间断点处的函数。
+
+
+我们实现 `bump` 的模型如下：
+
+<img src="img/bump_model.png" alt="bump model" width="350"/>
+
+得到函数：
+$$y = \sum_{i=1}^{4} {y_i} , $$ 
+$$y_i = a_i \cdot \text{ReLU} (w_i x + b_i), \space i = 1,2,3,4$$
+
+其中，
+
+$$w_1 = w_2 = w_3 = w_4 = 1 ; $$
+
+$$\begin{align} a_1 &= a_4 = \frac{d \times  h}{x_2 - x_1} \\\\ a_2 &= a_3 = -\frac{d \times h}{x_2 - x_1} ; \end{align}$$
+
+$$\begin{align} b_1 &= -x_1 \\\\ b_2 &= - (x_1 +\frac{x_2 - x_1}{d}) \\\\ b_3 &= - (x_2 - \frac{x_2 - x_1}{d}) \\\\ b_4 &= - x_2 . \end{align}$$
+
+如图，为我们通过 $4$ 个 $\text{ReLU}$ 函数实现的一个 `bump` ：
+
+<img src="img/lab02.png" alt="bump" width="400"/>
+
+> lab_number = 2
+
+这样，通过 $4N$ 个隐藏层的神经元，拟合 $N$ 个 `bump` ，可以拟合任意有界闭区间上连续的函数 $F(x)$ ，当 $N \to \infty, \space \delta \to 0$ 时，神经网络实现的函数 $G(x)$ 可以无限接近 $F(x)$ .
+
+***Reference***
+
+该方法参考 [Michael Nielsen (2019). Neural Networks and Deep Learning.](http://neuralnetworksanddeeplearning.com/chap4.html "Neural Networks and Deep Learning")
+
+
+### 实现方法 2 —— Backpropagation
+在 ***实现方法1*** 中，我们通过计算的方式设置了神经网络的参数，这种方式很容易理解，但需要非常多的 `bump` （ $N \to \infty$ ）来精确的拟合函数，对应的神经元数量是 `bump` 数量的 $4$ 倍（ $4N \to \infty$ ），原因就在于没有充分利用全连接神经网络中的参数。
+
+在这里，对于相同结构的神经网络模型，我们使用反向传播算法自动训练神经网络的参数。具体步骤为：
+
+1. 通过定义的函数产生样本数据；
+2. 前向传播后得到损失（采用 `mini-batch` 的方式）；
+3. 根据损失计算梯度，通过反向传播调整模型参数.
+
+#### 参数初始化
+由于在 Backpropagation 的方法中，神经网络参数的初始化对最终结果有一定影响，考虑到神经网络的激活函数是 $\text{ReLU}$ 函数，采用 $\text{Kaiming}$ 初始化的方法（参考 [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](https://arxiv.org/pdf/1502.01852.pdf "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification")）。
+
+方法如下：
+
+记第 $l$ 层的参数为 $\boldsymbol{w_l}$ ，该层的神经元个数为 $n_l$ ，令
+$$\boldsymbol{w_l} \sim \mathcal{N}(0, \sqrt{\frac{2}{n_l}}).$$ 
+
+与论文中略微不同，实验中第一层网络中的 $\boldsymbol{b}$ 也通过这个方式进行了初始化。
+
+具体的，我们通过 `np.random.randn` 生成服从 $\mathcal{N}(0, 1)$ 的数，而后乘以 $\sqrt{\frac{2}{n_l}}$ 来调整方差。
+
+#### Loss Function
+对于回归问题，常见的损失函数有 `MSE (Mean Squared Error)` 、 `MAE (Mean Absolute Error)` 和 `Huber Loss`.
+
+##### MSE (Mean Squared Error)
+`MSE` 衡量的是模型预测值与真实值之间的距离平方的平均值。
+记 $\hat{y}$ 为模型预测值， $y$ 为函数真实值，对于 $N$ 个样本，
+$$MSE = \frac{1}{N} \sum_{i=1}^{N}{(\hat{y}_i - y_i)^2}. $$
+
+`MSE` 曲线光滑、连续、可导，且随着误差的减小，梯度也会减小，便于收敛；但是对于一些异常值较为敏感，可能会降低模型的性能。
+
+##### MAE (Mean Absolute Error)
+相对应的，
+$$MAE = \frac{1}{N} \sum_{i=1}^{N}{|\hat{y}_i - y_i|}.$$
+
+`MAE` 连续但在 $\hat{y}_i - y_i = 0$ 处不可导，且其他区域处处梯度相同，不利于函数收敛以及梯度下降的学习；但另一方面，由于不同大小的误差对应的梯度相同， `MAE` 对异常值不太敏感，可以弥补 `MSE` 这方面的问题.
+
+##### Huber Loss
+`Huber Loss` 将 `MSE` 和 `MAE` 结合起来，公式为：
+$$\mathcal{L}_{\lambda} (\hat{y}, y) = \begin{cases} \frac{1}{2}(\hat{y}_i - y_i)^2, && |\hat{y}_i - y_i| \le \lambda \\\\ \lambda |\hat{y}_i - y_i| - \frac{1}{2} \lambda ^2, && |\hat{y}_i - y_i| > \lambda \end{cases}$$
+
+其中，超参数 $\lambda$ 决定了对 `MSE` 和 `MAE` 的侧重性，当 $\lambda$ 较大时，我们更多的使用 `MSE` ，当 $\lambda$ 较小时，我们更多的使用 `MAE` .
+
+在实验中，我们采用 `Huber Loss` 作为损失函数，令 $\lambda = 1.0$ .
+
+#### 参数优化方法
+在实验中，尝试了两种优化参数的算法： `SGD` （随机梯度下降）和 `Adam` （参考 [Adam](https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-6be9a291375c#:~:text=Adam%20is%20an%20adaptive%20learning%20rate%20method%2C%20which,rate%20for%20each%20weight%20of%20the%20neural%20network. "Adam") ）。
+
+对于 `Adam` 的参数，设定 $\beta_1 = 0.9, \space \beta_2 = 0.999.$
+
+在实验中统一使用了 `Adam` 。
+
+#### Grid Search
+实现的代码可以对 `learning_rate` , `batch_size` , `hidden_size` 等超参数进行网格搜索。
+
+搜索时验证集占整个训练集数据的 $20\%$ .
+
+#### 评价方式
+我们通过给定的函数先生成 $10,000$ 个样本数据，将其中的 $20\%$ 的数据划为测试集，而后对模型在测试集上的表现进行评价。（最终选定的模型会训练 `50` 个 `epoch` ）。
+
+选择 `R squared (coefficient of determination)` （也叫 `Proportion of Variance Explained` ）对拟合结果进行评价，具体计算方法为：
+
+$$\begin{align}R^2 &= 1 - \frac{\text{Unexplained Variation}}{\text{Total Variation}} \\\\ &= 1 - \frac{\sum_i{(\hat{y}_i - y_i})^2}{\sum_i{(y_i - \bar{y})^2}} \\\\ &= 1 - \frac{MSE}{Variance}\end{align}$$
+
+其中， $\hat{y}_i$ 为模型的预测值； $y_i$ 为真实值； $\bar{y}$ 为真实值的均值。
+
+$R^2 \in (-\infty, 1]$ ， $R^2$ 的值越大，说明模型预测效果越好；
+
+当 $R^2 = 1$ 时，模型的预测值全都等于真实值；当 $R^2 = 0$ 时，一种可能的情况是“模型简单的预测所有的值都等于 $y$ 的平均值”；
+
+$R^2$ 一般在 $[0, 1]$ 之间，当 $R^2 < 0$ 时，模型预测非常差，可能用了错误模型或模型假设不合理。
+
+$R^2$ 因为去除了真实值本身方差 $Variance$ 的因素，因而可以在不同的情况下进行比较。
+
+
+
+### 拟合三类函数
+这里对以下三类函数进行拟合：
+1. 多项式函数
+2. 三角多项式函数
+3. 其他函数
+
+选择这三类函数的原因是：
+
+根据我已有的数学知识，通过 $\text{Taylor}$ 级数（多项式函数） 和 $\text{Fourier}$ 级数（三角多项式函数）可以拟合大多数常见的函数；
+
+而其他函数中则包括了一些较为少见的 `pathological functions` ，来进一步测试神经网络的拟合能力。
+
+实验中对每一个给定的函数观察在 $[-2, 2]$ 闭区间内的拟合情况。
+
+对于 `实现方法1 (split into bumps)` ，统一规定将区间分割为 $30$ 个子区间。
+
+#### 多项式函数
+##### Taylor 公式
+根据 [Taylor's Theorem](https://en.wikipedia.org/wiki/Taylor%27s_theorem "Taylor's Theorem") ，若函数 $f$ 在 $x_0$ 的某个邻域 $O(x_0, r)$ 上具有 $n+1$ 阶导数，且其 $n+1$ 阶余项收敛于 $0$ ，那么我们可以通过 $\text{Taylor}$ 级数的部分和，即 $n$ 次 $\text{Taylor}$ 多项式
+$$f(x_0) + f'(x_0)(x - x_0) + \frac{f''(x_0)}{2!}(x - x_0)^2 + \cdots + \frac{f^{(n)}(x_0)}{n!}(x - x_0)^n$$
+在 $O(x_0, r)$ 上近似 $f(x)$ 。
+
+虽然 $\text{Taylor}$ 公式要求函数 $n+1$ 阶可导，且只是局部近似，但根据 [Weierstrass Approximation Theorem](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem#Weierstrass_approximation_theorem "Weierstrass Approximation Theorem") ，任意一个在有界闭区间 $[a, b]$ 上连续的实函数 $f(x)$ 都可以展开成在这个区间上一致收敛的多项式函数，即：
+对于 $\forall \epsilon > 0$ ，存在一个多项式函数 $P(x)$ ，使得
+$$|P(x) - f(x)| < \epsilon, \space x \in [a, b].$$
+
+因此，若两层 $\text{ReLU}$ 神经网络能够拟合任意的多项式函数，那么它就能够拟合任意有界闭区间上的连续函数。
+
+##### 实验拟合函数
+我们选择以下两个多项式函数进行拟合：
+1. $$f(x) = 1 - x + x^2 - x^3 + x^4 - x^5 + x^6 - x^7$$
+
+***实现方法1***
+
+<img src="img/lab03.png" alt="lab03" width="400"/>
+
+> lab_number = 3
+
+***实现方法2***
+
+经过网格搜索（每组超参数训练 `6` 个 `epoch`）后，选择的超参数为：
+
+| Hyperparameter |   Value    |
+|   :--------:   |  :-----:   |
+| hidden_size    | 1000       |
+| batch_size     | 90         |
+| learning_rate  | 0.01       | 
+
+模型学习过程中的损失 `loss` 变化如下：
+
+<img src="img/lab09_loss.png" alt="lab09_loss" width="400"/>
+
+拟合函数情况如下：
+
+<img src="img/lab09.png" alt="lab09" width="400"/>
+
+> lab_number = 9
+
+最终的损失
+$$R^2 = 0.99984384$$ 
+非常接近于 $1$ ，可以判断模型拟合的非常好。
+
+
+2. $$f(x) = x - \frac{x^3}{3} + \frac{x^5}{5} - \frac{x^7}{7} + \frac{x^9}{9} - \frac{x^{11}}{11}$$
+
+***实现方法1***
+
+<img src="img/lab04.png" alt="lab04" width="400"/>
+
+> lab_number = 4
+
+***实现方法2***
+
+经过网格搜索（每组超参数 `6` 个 `epoch`）后，选择的超参数为：
+
+| Hyperparameter |   Value    |
+|   :--------:   |  :-----:   |
+| hidden_size    | 1000       |
+| batch_size     | 30         |
+| learning_rate  | 0.01       | 
+
+模型学习过程中的损失 `loss` 变化如下：
+
+<img src="img/lab10_loss.png" alt="lab10_loss" width="400"/>
+
+拟合函数情况如下：
+
+<img src="img/lab10.png" alt="lab10" width="400"/>
+
+> lab_number = 10
+
+最终的损失
+$$R^2 = 0.99880380$$ 
+非常接近于 $1$ 。
+
+
+#### 三角多项式函数
+##### Fourier 级数
+根据 [Fourier Series](https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118548448.app1 "Fourier Series") ，若函数 $f$  $\text{Riemann}$ 可积，且在给定的周期 $[-T,T]$ 内，满足下两个条件之一：
+1. 分段单调且有界；
+2. 分段可导，
+
+则 $f$ 的 $\text{Fourier}$ 展开
+$$f(x) \sim \frac{a_0}{2} + \sum_{n=1}^{\infty}{(a_n \cos nx + b_n \sin nx)}, $$
+
+在每一点均收敛，具体的，在 $x$ 点处收敛于
+$$\frac{f(x-0) + f(x+0)}{2}.$$
+
+也就是说，当 $x$ 是 $f$ 的连续点时，$\text{Fourier}$ 级数收敛于 $f(x)$ ；对于左右极限存在的间断点，级数收敛于左右极限的算术平均值。
+
+因而可以通过 $\text{Fourier}$ 级数的部分和，近似函数 $f$ .
+
+若两层 $\text{ReLU}$ 神经网络能够拟合任意的三角多项式函数，它就能够以任意精度拟合有界闭区间上的连续函数。
+
+##### 实验拟合函数
+我们在这里尝试拟合各种类型的三角级数（$\text{Fourier}$ 级数）的部分和函数。
+
+与 $\text{Taylor}$ 级数不同，我们选择一些本身有有限间断点、或不可导的函数展开形成的 $\text{Fourier}$ 级数。
+
+1. $$f(x) = \frac{1}{2} - \frac{2}{\pi} \sum_{n=1}^{10} {\frac{\sin (2n+1)x}{2n+1}}$$
+
+该函数是 $F(x) = \begin{cases} 1, && x \in [-\pi, 0) \\\\ 0, && x \in [0, \pi)\end{cases}$ 的 $\text{Fourier}$ 展开的部分和。
+
+***实现方法1***
+
+<img src="img/lab05.png" alt="lab05" width="400"/>
+
+> lab_number = 5
+
+***实现方法2***
+
+经过网格搜索（每组超参数 `20` 个 `epoch`）后，选择的超参数为：
+
+| Hyperparameter |   Value    |
+|   :--------:   |  :-----:   |
+| hidden_size    | 1000       |
+| batch_size     | 30         |
+| learning_rate  | 0.0001     | 
+
+模型学习过程中的损失 `loss` 变化如下：
+
+<img src="img/lab11_loss.png" alt="lab11_loss" width="400"/>
+
+拟合函数情况如下：
+
+<img src="img/lab11.png" alt="lab11" width="400"/>
+
+> lab_number = 11
+
+对于这个函数，一开始网格搜索（每组超参数 `6` 个 `epoch`）时，训练结果并不理想，原因在于 `6` 个 `epoch` 训练不足，较小的学习率（梯度下降的 `step` ）收敛较慢，表现不好；（在之前的实验中，网格搜索也都选择了最大的学习率）；而由于该函数波动比较大，学习率不能过大，否则很容易跳过最优点。
+
+因而我们调整了网格搜索中每组超参数训练的 `epoch` 数量，最终得到了更好的结果：
+$$R^2 = 0.99973780$$ 
+非常接近于 $1$ 。
+
+
+2. $$f(x) = \sum_{n=0}^{10} {\frac{1}{2^n} \cos (17^n \pi x)}$$
+
+这是 [Weierstrass function](https://en.wikipedia.org/wiki/Weierstrass_function "Weierstrass function") 的部分和。
+
+具体的，令 $a \in (0,1)$ ， $b$ 为奇数，且 $ab > 1 + \frac{3 \pi}{2}$ ，则 $\text{Weierstrass}$ 函数
+$$F(x) = \sum_{n=0}^{\infty} {a^n \cos (b^n \pi x)}$$
+绝对收敛，且处处连续，处处不可导.
+
+这里选择 $a = \frac{1}{2}, \space b = 17, $ 考察神经网络对级数部分和的拟合情况：
+$$f(x) = \sum_{n=0}^{10} {(\frac{1}{2})^n \cos (17^n \pi x)}$$ 
+
+
+***实现方法1***
+
+<img src="img/lab06.png" alt="lab06" width="400"/>
+
+> lab_number = 6
+
+***实现方法2***
+
+
+经过网格搜索（每组超参数 `6` 个 `epoch`）后，选择的超参数为：
+
+| Hyperparameter |   Value    |
+|   :--------:   |  :-----:   |
+| hidden_size    | 1000       |
+| batch_size     | 90         |
+| learning_rate  | 0.001      | 
+
+
+模型学习过程中的损失 `loss` 变化如下：
+
+<img src="img/lab12_loss.png" alt="lab12_loss" width="400"/>
+
+拟合函数情况如下：
+
+<img src="img/lab12.png" alt="lab12" width="400"/>
+
+> lab_number = 12
+
+最终的损失
+$$R^2 = 0.10848555.$$ 
+
+这个函数由于波动太剧烈，非常难学习。
+
+对于 ***实现方法1*** ，通过加大 `bump` 的数量 $N$ ，可以更好的拟合这个函数；
+
+但对于 ***实现方法2*** ，我们改变了网格搜索时训练的 `epoch` 数量、样本数据大小、学习率等超参数，仍然难以很好的进行拟合，说明当遇到这种波动非常大，甚至在极端情况下处处不可导的函数时，通过神经网络进行拟合还是有一定困难的。
+
+
+#### 其他函数
+我们进一步挑选了一些 `pathological functions (病态函数)` 尝试拟合。
+
+1. Mollifier function
+$$f(x) = \begin{cases} \exp\\{- \frac{1}{1 - |x|^2}\\}, && |x| < 1 \\\\ 0, && |x| \ge 1 \end{cases}, $$
+
+[Mollifier](https://en.formulasearchengine.com/wiki/Mollifier "Mollifier") 函数无穷阶可导，但在 $|x| = 1$ 处导数消失，因而不解析，其展开的 $\text{Taylor}$ 级数不收敛于该函数。
+
+***实现方法1***
+
+<img src="img/lab07.png" alt="lab07" width="400"/>
+
+> lab_number = 7
+
+***实现方法2***
+
+
+经过网格搜索（每组超参数 `6` 个 `epoch`）后，选择的超参数为：
+
+| Hyperparameter |   Value    |
+|   :--------:   |  :-----:   |
+| hidden_size    | 100        |
+| batch_size     | 60         |
+| learning_rate  | 0.01       | 
+
+模型学习过程中的损失 `loss` 变化如下：
+
+<img src="img/lab13_loss.png" alt="lab13_loss" width="400"/>
+
+拟合函数情况如下：
+
+<img src="img/lab13.png" alt="lab13" width="400"/>
+
+> lab_number = 13
+
+最终的损失
+$$R^2 = 0.99987666$$ 
+
+
+2. Non-analytic smooth function
+$$f(x) = \begin{cases} \exp \\{- \frac{1}{x}\\}, && x > 0 \\\\ 0, && x \le 0 \end{cases}$$
+
+[Non-analytic smooth function](https://en.wikipedia.org/wiki/Non-analytic_smooth_function#A_smooth_function_which_is_nowhere_real_analytic "Non-analytic smooth function") 在实数轴上处处无穷阶可导且导数连续，但不解析，由于其原点处的导数为 $0$ ，该点处的 $\text{Taylor}$ 级数收敛于 $0$ ，即：
+$$\sum_{n=0}^{\infty} {\frac{f^{(n)}(0)}{n!} x^n} = \sum_{n=0}^{\infty}{\frac{0}{n!} x^n} = 0, \space x \in \mathcal{R}$$
+
+***实现方法1***
+
+<img src="img/lab08.png" alt="lab08" width="400"/>
+
+> lab_number = 8
+
+***实现方法2***
+
+
+经过网格搜索（每组超参数 `6` 个 `epoch`）后，选择的超参数为：
+
+| Hyperparameter |   Value    |
+|   :--------:   |  :-----:   |
+| hidden_size    | 100        |
+| batch_size     | 30         |
+| learning_rate  | 0.01       | 
+
+模型学习过程中的损失 `loss` 变化如下：
+
+<img src="img/lab14_loss.png" alt="lab14_loss" width="400"/>
+
+
+拟合函数情况如下：
+
+<img src="img/lab14.png" alt="lab14" width="400"/>
+
+> lab_number = 14
+
+最终的损失
+$$R^2 = 0.99988132$$ 
+
+可以看到，对于这种光滑的函数，即使不解析，不能通过 $\text{Taylor}$ 级数近似，神经网络的拟合情况依然很好。
+
+
+## 总结
+- 第一部分通过理论证明了，双层 $\text{ReLU}$ 神经网络具有近似任意有界闭集上的连续函数的能力。
+
+- 而后尝试了两种实现的方法：
+    - `split into bumps` ，即将区间分割成 $N$ 个子区间，而后对每个子区间上的函数值进行拟合，即拟合不同的 `bump` 。
+
+        - 拟合的具体方法为 $$4 \times \text{ReLU neuron} \longrightarrow 2 \times \text{squashing function} \longrightarrow 1 \times \text{bump}.$$ 
+        
+        - 因而隐藏层的神经元（ $4N$ ）越多，对函数拟合得越好；
+
+        - 这种方法便于理解，可解释性较强，但需要极大数量的隐藏层神经元来对函数进行拟合，没有充分利用全连接神经网络中的参数。
+
+    - `Backpropagation` ，即常用的训练神经网络模型的方法，通过标注数据以及给定的损失函数，自动训练模型的参数
+    
+        - 这种方法更充分的利用了模型中的参数，通过较少的隐藏层神经元实现拟合；
+
+        - 在训练过程中，参数的初始化、训练时的学习率等都会对模型的拟合产生很大的影响；如何更好的训练模型，使模型更高效的收敛，还有很大探究的空间；
+
+        - 对于波动较为剧烈的函数，通过 `Backpropagation` 的方法较难拟合；
+        
+        - 通过这种方法训练时，不同的超参数、参数对模型表现的影响较难分析，模型可解释性较差。
+
+
+## Reference
+
+[1]. [Hornik et al., 1989. Multilayer Feedforward Networks are Universal Approximators.](https://www.cs.cmu.edu/~epxing/Class/10715/reading/Kornick_et_al.pdf "Multilayer Feedforward Networks are Universal Approximators") 
+
+[2]. [Michael Nielsen, 2019. Neural Networks and Deep Learning.](http://neuralnetworksanddeeplearning.com/chap4.html "Neural Networks and Deep Learning")
+
+[3]. [K. He, X. Zhang, S. Ren and J. Sun, 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.](https://arxiv.org/pdf/1502.01852.pdf "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification")
+
+[4]. [Weierstrass Approximation Theorem.](https://en.wikipedia.org/wiki/Stone%E2%80%93Weierstrass_theorem#Weierstrass_approximation_theorem "Weierstrass Approximation Theorem")
+
+[5]. [Mollifier function.](https://en.formulasearchengine.com/wiki/Mollifier "Mollifier")
+
+[6]. [Non-analytic smooth function.](https://en.wikipedia.org/wiki/Non-analytic_smooth_function#A_smooth_function_which_is_nowhere_real_analytic "Non-analytic smooth function")
+
+[7]. [Weierstrass function.](https://en.wikipedia.org/wiki/Weierstrass_function "Weierstrass function")
+
+[8]. [Riemann integral.](https://en.wikipedia.org/wiki/Riemann_integral "Riemann integral") 
+
+
+## 代码使用方法
+```
+python source.py lab_number #lab_number = 1, 2, ..., 14
+    e.g., python source.py 1
+```
\ No newline at end of file
diff --git a/assignment-2/submission/18300110042/img/bump_model.png b/assignment-2/submission/18300110042/img/bump_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..b163c5d1ac0ca1bcd383c1c2c220069d1f56446d
Binary files /dev/null and b/assignment-2/submission/18300110042/img/bump_model.png differ
diff --git a/assignment-2/submission/18300110042/img/lab01.png b/assignment-2/submission/18300110042/img/lab01.png
new file mode 100644
index 0000000000000000000000000000000000000000..7b048b4ddaffa996adb312786b1f8adbea953f39
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab01.png differ
diff --git a/assignment-2/submission/18300110042/img/lab02.png b/assignment-2/submission/18300110042/img/lab02.png
new file mode 100644
index 0000000000000000000000000000000000000000..249c273af2d5b1322aaeb2b5455feaa318030f1e
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab02.png differ
diff --git a/assignment-2/submission/18300110042/img/lab03.png b/assignment-2/submission/18300110042/img/lab03.png
new file mode 100644
index 0000000000000000000000000000000000000000..8f1bb896a573b309a0d60b52172c6d87777d6ce4
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab03.png differ
diff --git a/assignment-2/submission/18300110042/img/lab04.png b/assignment-2/submission/18300110042/img/lab04.png
new file mode 100644
index 0000000000000000000000000000000000000000..3f2b420672a840f8396f1325ba0f559af9aba757
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab04.png differ
diff --git a/assignment-2/submission/18300110042/img/lab05.png b/assignment-2/submission/18300110042/img/lab05.png
new file mode 100644
index 0000000000000000000000000000000000000000..1922301cd722f868955da6882914b67d64c0c8ee
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab05.png differ
diff --git a/assignment-2/submission/18300110042/img/lab06.png b/assignment-2/submission/18300110042/img/lab06.png
new file mode 100644
index 0000000000000000000000000000000000000000..6806f3108bac9280a260da9824ee1a42a280de20
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab06.png differ
diff --git a/assignment-2/submission/18300110042/img/lab07.png b/assignment-2/submission/18300110042/img/lab07.png
new file mode 100644
index 0000000000000000000000000000000000000000..dabb0e653c791eced71ca3cf466ec9a7e3fa736d
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab07.png differ
diff --git a/assignment-2/submission/18300110042/img/lab08.png b/assignment-2/submission/18300110042/img/lab08.png
new file mode 100644
index 0000000000000000000000000000000000000000..a2e9f57904639f48d29641eb6b9155bd39556dc7
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab08.png differ
diff --git a/assignment-2/submission/18300110042/img/lab09.png b/assignment-2/submission/18300110042/img/lab09.png
new file mode 100644
index 0000000000000000000000000000000000000000..1939508d083254cd8292fbd4ca705716ea222de2
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab09.png differ
diff --git a/assignment-2/submission/18300110042/img/lab09_loss.png b/assignment-2/submission/18300110042/img/lab09_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..c578d1c61a1fa6c2b3dd72dbb0ef98d59dc0558b
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab09_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab10.png b/assignment-2/submission/18300110042/img/lab10.png
new file mode 100644
index 0000000000000000000000000000000000000000..f7cbd990e0eaf241702568e73861959594fc9146
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab10.png differ
diff --git a/assignment-2/submission/18300110042/img/lab10_loss.png b/assignment-2/submission/18300110042/img/lab10_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..74ddd6a1a0f93455f1a5e2235a37f7acae496408
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab10_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab11.png b/assignment-2/submission/18300110042/img/lab11.png
new file mode 100644
index 0000000000000000000000000000000000000000..bac94610d4890e9d106c0a9d3c79bde4a358f694
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab11.png differ
diff --git a/assignment-2/submission/18300110042/img/lab11_loss.png b/assignment-2/submission/18300110042/img/lab11_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..eabb70779d0bba17a25f755a8395f8120b8c7b2c
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab11_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab12.png b/assignment-2/submission/18300110042/img/lab12.png
new file mode 100644
index 0000000000000000000000000000000000000000..a5cdf8a04098f453dd1ae9053ada6c5bbb238780
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab12.png differ
diff --git a/assignment-2/submission/18300110042/img/lab12_loss.png b/assignment-2/submission/18300110042/img/lab12_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..672faf1adcdd6ff97d55a88cf03e43223f963f08
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab12_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab13.png b/assignment-2/submission/18300110042/img/lab13.png
new file mode 100644
index 0000000000000000000000000000000000000000..4e418b9a656be8f0b8624067923fe33a25897cf7
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab13.png differ
diff --git a/assignment-2/submission/18300110042/img/lab13_loss.png b/assignment-2/submission/18300110042/img/lab13_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..479460dfb7d5262536277da01af1d182ef798fde
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab13_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/lab14.png b/assignment-2/submission/18300110042/img/lab14.png
new file mode 100644
index 0000000000000000000000000000000000000000..5d5df09671f88a3ef56dbcec2aa824c479af4f30
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab14.png differ
diff --git a/assignment-2/submission/18300110042/img/lab14_loss.png b/assignment-2/submission/18300110042/img/lab14_loss.png
new file mode 100644
index 0000000000000000000000000000000000000000..dd9c4c3ef9c89dabb7de5274a8d3e477e5e32442
Binary files /dev/null and b/assignment-2/submission/18300110042/img/lab14_loss.png differ
diff --git a/assignment-2/submission/18300110042/img/nn_model.png b/assignment-2/submission/18300110042/img/nn_model.png
new file mode 100644
index 0000000000000000000000000000000000000000..964f1d1b689cd46847c73bc6f683dad36b9c9fa9
Binary files /dev/null and b/assignment-2/submission/18300110042/img/nn_model.png differ
diff --git a/assignment-2/submission/18300110042/img/squashing_function_ReLU.png b/assignment-2/submission/18300110042/img/squashing_function_ReLU.png
new file mode 100644
index 0000000000000000000000000000000000000000..482a57daecc727c979a033bef4d24328b800476a
Binary files /dev/null and b/assignment-2/submission/18300110042/img/squashing_function_ReLU.png differ
diff --git a/assignment-2/submission/18300110042/source.py b/assignment-2/submission/18300110042/source.py
new file mode 100644
index 0000000000000000000000000000000000000000..0977e20798c9de00510147bdd009ffe4efe63072
--- /dev/null
+++ b/assignment-2/submission/18300110042/source.py
@@ -0,0 +1,543 @@
+import sys, time
+import numpy as np
+import matplotlib.pyplot as plt
+#from mpl_toolkits import mplot3d
+
+class MatMul:
+    def __init__(self, W):
+        self.params = [W]
+        self.grads = [np.zeros_like(W)]
+        self.x = None
+
+    def forward(self, x):
+        W, = self.params
+        out = np.dot(x, W)
+        self.x = x
+        return out
+
+    def backward(self, dout):
+        W, = self.params
+        dx = np.dot(dout, W.T)
+        dW = np.dot(self.x.T, dout)
+        self.grads[0][...] = dW
+        return dx
+
+class Affine:
+    def __init__(self, W, b):
+        self.params = [W, b]
+        self.grads = [np.zeros_like(W), np.zeros_like(b)]
+        self.x = None
+
+    def forward(self, x):
+        W, b = self.params
+        out = np.dot(x, W) + b
+        self.x = x
+        return out 
+
+    def backward(self, dout):
+        W, b = self.params
+        dx = np.dot(dout, W.T)
+        dW = np.dot(self.x.T, dout)
+        db = np.sum(dout, axis=0)
+        self.grads[0][...] = dW
+        self.grads[1][...] = db
+        return dx
+
+class ReLU:
+    def __init__(self):
+        self.params, self.grads = [], []
+        self.x = None
+
+    def forward(self, x):
+        self.mask = (x <= 0)
+        out = x.copy()
+        out[self.mask] = 0
+        return out
+
+    def backward(self, dout):
+        dout[self.mask] = 0
+        dx = dout
+        return dx 
+
+class IdentityWithMSELoss:
+    '''
+    identity neuron node with mean square error(MSE) loss 
+    '''
+    def __init__(self):
+        self.params, self.grads = [], []
+
+    def forward(self, x, t):
+        self.y = x
+        self.t = t
+        loss = 0.5 * np.mean(np.sum(np.square(x - t), axis=1))
+        return loss
+
+    def backward(self, dout=1):
+        batch_size = self.t.shape[0]
+        dx = (self.y - self.t) * dout / batch_size
+        return dx
+
+class IdentityWithHuberLoss:
+    '''
+    identity neuron node with Huber error loss 
+    '''
+    def __init__(self, beta=1.0):
+        self.params, self.grads = [], []
+        self.beta = beta
+
+    def forward(self, x, t):
+        self.y = x
+        self.t = t
+        l1 = np.sum(self.beta * (np.abs(x - t) - 0.5 * self.beta), axis=1)
+        l2 = 0.5 * np.sum(np.square(x - t), axis=1)
+        l1_mask = (x - t >= self.beta).astype(int)
+        l2_mask = 1 - l1_mask
+        loss = np.mean(l1 * l1_mask + l2 * l2_mask)
+        return loss
+
+    def backward(self, dout=1):
+        batch_size = self.t.shape[0]
+        d_l1 = self.beta * np.abs(self.y - self.t) * dout / batch_size
+        d_l2 = (self.y - self.t) * dout / batch_size
+        l1_mask = (self.y - self.t >= self.beta).astype(int)
+        l2_mask = 1 - l1_mask
+        dx = d_l1 * l1_mask + d_l2 * l2_mask
+        return dx
+
+class SGD:
+    '''
+    stochastic gradient descent(SGD) optimization 
+    '''
+    def __init__(self, lr=0.001):
+        self.lr = lr
+
+    def update(self, params, grads):
+        for i in range(len(params)):
+            params[i] -= self.lr * grads[i]
+
+class Adam:
+    '''
+    adaptive moment(Adam) optimization
+    '''
+    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
+        self.lr = lr
+        self.beta1 = beta1
+        self.beta2 = beta2
+        self.iter = 0
+        self.m = None
+        self.v = None
+        
+    def update(self, params, grads):
+        if self.m is None:
+            self.m, self.v = [], []
+            for param in params:
+                self.m.append(np.zeros_like(param))
+                self.v.append(np.zeros_like(param))
+        self.iter += 1
+        lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)
+        for i in range(len(params)):
+            self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i])
+            self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i])
+            params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)        
+
+class TwoReLULayerNet:
+    def __init__(self, input_size=1, hidden_size=10, output_size=1, 
+                weight_scale='he', loss_func='huber'):
+        '''
+        weight_scale:   scale value used to initialize weight. 
+                        there are 3 possible value 'he', 'xavier' and 'std'
+                        'he': set sqrt(2/n) as weight init scale, proposed by Kaiming He
+                        'xavier': set sqrt(1/n) as weight init scale, proposed by Xavier Glorot
+                        'std': set 0.01 as weight init scale
+        loss_func:      loss function type. there are 2 possible value here, 'huber' and 'mse'
+                        'huber': Huber error loss
+                        'mse': MSE loss 
+        '''
+        self.input_size = input_size
+        self.hidden_size = hidden_size
+        self.output_size = output_size
+        self.weight_scale = weight_scale
+        self.loss_func = loss_func
+        # init w and b 
+        scale = self.get_weight_scale(input_size) 
+        W1 = scale * np.random.randn(input_size, hidden_size)
+        b1 = np.random.randn(hidden_size)
+        scale = self.get_weight_scale(hidden_size)
+        W2 = scale * np.random.randn(hidden_size, output_size)
+        # init layers
+        self.layers = [Affine(W1, b1), ReLU(), MatMul(W2)]
+        if loss_func == 'huber':
+            self.loss_layer = IdentityWithHuberLoss()
+        else: # mse loss function
+            self.loss_layer = IdentityWithMSELoss()
+        # init parameters and gradients
+        self.params, self.grads = [], []
+        for layer in self.layers:
+            self.params += layer.params
+            self.grads += layer.grads
+    
+    def get_weight_scale(self, node_size):
+        scale = 0.01
+        if self.weight_scale == 'he':
+            scale = np.sqrt(2.0 / node_size)
+        elif self.weight_scale == 'xavier':
+            scale = np.sqrt(2.0 / node_size)
+        return scale
+
+    def predict(self, x):
+        for layer in self.layers:
+            x = layer.forward(x)
+        return x
+
+    def forward(self, x, t):
+        y = self.predict(x)
+        loss = self.loss_layer.forward(y, t)
+        return loss
+
+    def backward(self, dout=1):
+        dout = self.loss_layer.backward(dout)
+        for layer in reversed(self.layers):
+            dout = layer.backward(dout)
+        return dout
+
+class Trainer:
+    def __init__(self, model, optimizer):
+        self.model = model
+        self.optimizer = optimizer
+        self.loss_list = []
+        self.eval_interval = None
+
+    def fit(self, x, t, max_epoch=10, batch_size=36, eval_interval=20):
+        data_size = len(x)
+        max_iters = data_size // batch_size
+        self.eval_interval = eval_interval
+        model, optimizer = self.model, self.optimizer
+        epoch_loss, epoch_loss_count = 0, 0
+        start_time = time.time()
+        for epoch in range(max_epoch):
+            idx = np.random.permutation(np.arange(data_size))
+            x = x[idx]
+            t = t[idx]
+            iter_loss, iter_loss_count = 0, 0
+            for iter in range(max_iters):
+                batch_x = x[iter*batch_size:(iter+1)*batch_size]
+                batch_t = t[iter*batch_size:(iter+1)*batch_size]
+                loss = model.forward(batch_x, batch_t)
+                model.backward()
+                optimizer.update(model.params, model.grads)
+                epoch_loss += loss
+                epoch_loss_count += 1
+                iter_loss += loss
+                iter_loss_count += 1
+                if (eval_interval is not None) and (iter % eval_interval) == 0:
+                    iter_avg_loss = iter_loss / iter_loss_count
+                    self.loss_list.append(float(iter_avg_loss))
+                    iter_loss, iter_loss_count = 0, 0
+            epoch_avg_loss = epoch_loss / epoch_loss_count
+            elapsed_time = time.time() - start_time
+            print('| epoch %d | time %d[s] | loss %.2f' % (epoch + 1, elapsed_time, epoch_avg_loss))
+            epoch_loss, epoch_loss_count = 0, 0
+
+    def plot(self, ylim=None):
+        x = np.arange(len(self.loss_list))
+        if ylim is not None:
+            plt.ylim(*ylim)
+        plt.plot(x, self.loss_list, label='train')
+        plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
+        plt.ylabel('loss')
+        plt.show()
+
+def grid_search_cv(x, t, param_grid, max_epoch=5):
+    input_size = x.shape[1]
+    output_size = t.shape[1]
+    hidden_size, weight_scale, loss_func, batch_size, learning_rate = np.meshgrid(
+            param_grid['hidden_size'], 
+            param_grid['weight_scale'],
+            param_grid['loss_func'],
+            param_grid['batch_size'], 
+            param_grid['learning_rate'])
+    params = [p for p in zip(hidden_size.flat, weight_scale.flat, 
+            loss_func.flat, batch_size.flat, learning_rate.flat)]
+    min_score = 99999999
+    best_param = {}
+    for param in params:
+        print(param)
+        # split train data and validation data 
+        (x_train, t_train), (x_val, t_val) = split_dataset(x, t, 0.2)
+        # train model with train data
+        model = TwoReLULayerNet(input_size, param[0], output_size, param[1], param[2])
+        optimizer = Adam(param[4])
+        trainer = Trainer(model, optimizer)
+        trainer.fit(x_train, t_train, max_epoch, param[3], None)
+        # evaluate model with validation data
+        score = model.forward(x_val, t_val)
+        print('score: ', score)
+        if score < min_score:
+            min_score = score
+            best_param['hidden_size'] = param[0]
+            best_param['weight_scale'] = param[1]
+            best_param['loss_func'] = param[2]
+            best_param['batch_size'] = param[3]
+            best_param['learning_rate'] = param[4]
+    return best_param
+
+def split_dataset(x, t, rate=0.2):
+    N = x.shape[0]
+    idx = np.random.permutation(np.arange(N))
+    x = x[idx]
+    t = t[idx]
+    split_idx = int(N * rate)
+    x_train = x[split_idx:]
+    t_train = t[split_idx:]
+    x_test = x[:split_idx]
+    t_test = t[:split_idx]
+    return (x_train, t_train), (x_test, t_test)
+
+def load_data(func, params):
+    # parse func parameters
+    x = params.get('x', np.array([]))
+    if len(x) == 1:
+        x = x[0]
+    else:
+        d_grid = np.meshgrid(*x)
+        x = np.array([p for p in zip(*[d.flat for d in d_grid])])
+    kwargs = params.get('kwargs', np.array([]))
+    # execute function to get 't'
+    t = eval(func)
+    # reshape x and t
+    N = x.shape[0]
+    x = x.reshape(N, -1)
+    N = t.shape[0]
+    t = t.reshape(N, -1)
+    return x, t
+
+def relu_func(x):
+    y = np.zeros_like(x)
+    y[...] = x
+    y[x < 0] = 0
+    return y
+
+def bump_func_1d(x, x1, x2, h):
+    y = np.ones_like(x) * h
+    y[x < x1] = 0
+    y[x > x2] = 0
+    return y
+
+def approx_bump_1d(x, x1, x2, h, d):
+    y1 = d * h * relu_func(x - x1) / (x2 - x1)
+    y2 = -d * h * relu_func(x - (x1  + (x2 - x1) / d)) / (x2 - x1)
+    y3 = -d * h * relu_func(x - (x2 - (x2 - x1) / d)) / (x2 - x1)
+    y4 = d * h * relu_func(x - x2) / (x2 - x1)
+    return y1 + y2 + y3 + y4
+
+def rsqare(x, t):
+    mse = np.mean(np.sum(np.square(x - t), axis=1))
+    loss = 1 - mse / np.var(t)
+    return loss
+
+def approx_func3(x):
+    sum = np.zeros_like(x)
+    for i in range(10):
+        t = np.sin((2 * i + 1) * x) / (2 * i + 1)
+        sum += t
+    return 0.5 - 2 * sum / np.pi
+
+def approx_func4(x):
+    sum = np.zeros_like(x)
+    for i in range(10):
+        t = np.cos((np.power(17,i) * np.pi * x) / np.power(2,i))
+        sum += t
+    return sum
+
+def approx_func5(x):
+    t = np.exp(-1 / (1 - np.square(x)))
+    t[np.abs(x) >= 1] = 0
+    return t
+
+def approx_func6(x):
+    t = np.exp(-1 / x)
+    t[x <= 0] = 0
+    return t
+
+max_epoch = 50
+eval_interval = 100
+data_size = 10000
+x_begin, x_end = -2, 2
+param_grid = {
+    'hidden_size': [10, 100, 1000],
+    'batch_size': [30, 60, 90],
+    'learning_rate': [1e-2, 1e-4, 1e-6],
+    'weight_scale': ['he'], #['he', 'xavier', 'std'],
+    'loss_func': ['huber'] #['huber', 'mse']
+}
+l_config = [
+    { # 1
+        'func':     'relu_func(x + 0.5) - relu_func(x - 0.5)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size*100)], 'kwargs': {'x1': -1, 'x2': 1, 'h': 5, 'd': 1000}},
+        'approx':   'draw',
+        'title':    'A squashing function constructed with ReLU'
+    },
+    { # 2
+        'func':     'approx_bump_1d(x, **kwargs)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size*100)], 'kwargs': {'x1': -1, 'x2': 1, 'h': 5, 'd': 1000}},
+        'approx':   'draw',
+        'title':    ''
+    },
+    { # 3
+        'func':     '1 - x + np.power(x,2) - np.power(x,3) + np.power(x,4) - np.power(x,5)/5 + np.power(x,6)/6 - np.power(x,7)/7',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+        'approx':   'split',
+        'split_num': 30,
+        'title':    ''
+    },
+    { # 4
+        'func':     'x - np.power(x,3)/3 + np.power(x,5)/5 - np.power(x,7)/7 + np.power(x,9)/9 - np.power(x,11)/11',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+        'approx':   'split',
+        'split_num': 30,
+        'title':    ''
+    },
+    { # 5
+        'func':     'approx_func3(x)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+        'approx':   'split',
+        'split_num': 30,
+        'title':    ''
+    },
+    { # 6 
+        'func':     'approx_func4(x)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size*10000)]},
+        'approx':   'split',
+        'split_num': 300,
+        'title':    ''
+    },
+    { # 7 
+        'func':     'approx_func5(x)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+        'approx':   'split',
+        'split_num': 30,
+        'title':    ''
+    },
+    { # 8
+        'func':     'approx_func6(x)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size*100)]},
+        'approx':   'split',
+        'split_num': 30,
+        'title':    ''
+    },
+    { # 9
+        'func':     '1 - x + np.power(x,2) - np.power(x,3) + np.power(x,4) - np.power(x,5)/5 + np.power(x,6)/6 - np.power(x,7)/7',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size)]},
+        'approx':   'dnn',
+        'title':    ''
+    },
+    { # 10
+        'func':     'x - np.power(x,3)/3 + np.power(x,5)/5 - np.power(x,7)/7 + np.power(x,9)/9 - np.power(x,11)/11',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size)]},
+        'approx':   'dnn',
+        'title':    ''
+    },
+    { # 11
+        'func':     'approx_func3(x)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size)]},
+        'approx':   'dnn',
+        'title':    ''
+    },
+    { # 12
+        'func':     'approx_func4(x)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size)]},
+        'approx':   'dnn',
+        'title':    ''
+    },
+    { # 13
+        'func':     'approx_func5(x)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size)]},
+        'approx':   'dnn',
+        'title':    ''
+    },
+    { # 14
+        'func':     'approx_func6(x)',
+        'params':   {'x': [np.linspace(x_begin, x_end, data_size)]},
+        'approx':   'dnn',
+        'title':    ''
+    },
+]
+
+if __name__ == '__main__':
+    max_lab = len(l_config)
+    
+    # parse user input
+    if len(sys.argv) > 1 and sys.argv[1].isnumeric():
+        lab_num = int(sys.argv[1])
+    else:
+        print('Usage: python source lab_num (1 ~ {0})'.format(max_lab))
+        sys.exit(-1) 
+
+    if lab_num > max_lab or lab_num < 1:
+        print('Usage: python source lab_num (1 ~ {0})'.format(max_lab))
+        sys.exit(-1)     
+
+    # prepare dataset
+    func = l_config[lab_num-1]['func']
+    params = l_config[lab_num-1]['params']
+    approx = l_config[lab_num-1]['approx']
+    title = l_config[lab_num-1]['title']
+    x_data, t_data = load_data(func, params)
+
+    dim = x_data.shape[1]
+    if dim > 1:
+        #print('Cannot draw diagram with dimension higher tha 3.')
+        print('Only implement senario with dimension 1')
+        exit(-1)
+
+    # approximate function 
+    if approx == 'dnn':
+        (x_train, t_train), (x_test, t_test) = split_dataset(x_data, t_data)
+        # grid search best hyperparameters
+        best_param = grid_search_cv(x_train, t_train, param_grid, max_epoch=6)
+        # create neuron network to approximate target function
+        model = TwoReLULayerNet(x_train.shape[1],
+                                best_param['hidden_size'],
+                                t_train.shape[1],
+                                best_param['weight_scale'],
+                                best_param['loss_func'])
+        optimizer = Adam(best_param['learning_rate'])
+        trainer = Trainer(model, optimizer)
+        trainer.fit(x_train, t_train, max_epoch, best_param['batch_size'], eval_interval)
+        # evaluate model with test data
+        print('best param: ', best_param)
+        t_predict = model.predict(x_test)
+        test_loss = rsqare(t_predict, t_test)
+        print('| final test | loss %.8f' % (test_loss))
+        # draw diagram
+        trainer.plot()
+        plt.plot(x_data, t_data)
+        plt.plot(x_data, model.predict(x_data), color='coral')
+    elif approx == 'split':
+        # split x for each dimention
+        plt.plot(x_data, t_data)
+        split_num = l_config[lab_num-1]['split_num']
+        x_split = np.linspace(x_begin, x_end, split_num+1)
+        x_sample = np.array([(x_split[i] + x_split[i+1]) / 2 for i in range(len(x_split)-1)])
+        xs = np.array_split(x_data, split_num)
+        t_bumps = np.array([])
+        for i in range(split_num):
+            # get h
+            params['x'] = [x_sample[i:i+1]]
+            _, h = load_data(func, params)
+            # get x1, x2
+            x1 = x_split[i]
+            x2 = x_split[i+1]
+            # load bump function data
+            bump_params = {}
+            bump_params['x'] = [xs[i]]
+            bump_params['kwargs'] = {'x1': x1, 'x2': x2, 'h': h[0], 'd': 1000}
+            bump_func = 'approx_bump_1d(x, **kwargs)'
+            x_bump, t_bump = load_data(bump_func, bump_params)
+            t_bumps = np.concatenate((t_bumps, t_bump.flatten()))
+            plt.plot(x_bump, t_bump, color='coral')
+    else:
+        plt.plot(x_data, t_data, color='coral')
+    plt.title(title)
+    plt.grid()
+    plt.show()
\ No newline at end of file