# DeepLearningFromScratch_2 **Repository Path**: onewayout/deep-learning-from-scratch-2 ## Basic Information - **Project Name**: DeepLearningFromScratch_2 - **Description**: 本仓库为在学习《深度学习入门-基于Python的理论与实现》所做的笔记 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2025-04-21 - **Last Updated**: 2025-04-21 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README 目录 - [一、前言](#0) - [准备工作](#0.1) - [二、内容介绍](#0.2) - [第一章 Python入门](#1) - [1.Matplotlib](#1.1) - [1.1.1pyplot](#1.1.1) - [1.1.2显示图像](#1.1.2) - [第二章感知机](#2) - [2.1感知机是什么 ](#2.1) - [2.2感知机的实现 ](#2.2) - [2.2.1导入权重和偏置 ](#2.2.1) - [2.2.2与门、与非门和或门的实现 ](#2.2.2) - [2.3感知机的局限性 ](#2.3) - [2.4多层感知机 ](#2.4) - [第三章神经网络 ](#3) - [3.1从感知机到神经网络 ](#3.1) - [3.1.1神经网络、感知机和激活函数 ](#3.1.1) - [3.2激活函数 ](#3.2) - [3.2.1sigmoid函数 ](#3.2.1) - [3.2.2阶跃函数、sigmod图形表示 ](#3.2.2) - [3.2.3ReLU函数 ](#3.2.3) - [3.3三层网络设计实现 ](#3.3) - [3.4输出层设计 ](#3.4) - [3.4.1恒等函数和softmax函数 ](#3.4.1) - [3.5手写数字体识别 ](#3.5) - [3.5.1MNIST数据集 ](#3.5.1) - [3.5.2神经网络的的推理处理 ](#3.5.2) - [3.5.3批处理 ](#3.5.3) - [第四章神经网络的学习 ](#4) - [4.1从数据中学习 ](#4.1) - [4.1.1数据驱动 ](#4.1.1) - [4.1.2训练数据和测试数据 ](#4.1.2) - [4.2损失函数 ](#4.2) - [4.2.1均方误差 ](#4.2.1) - [4.2.2交叉熵误差 ](#4.2.2) - [4.2.3mini-batch学习 ](#4.2.3) - [4.2.4mini-batch版交叉熵的实现 ](#4.2.4) - [4.2.5设定损失函数的实际意义 ](#4.2.5) - [4.3数值微分 ](#4.3) - [4.3.1导数 ](#4.3.1) - [4.4梯度 ](#4.4) - [4.4.1梯度法 ](#4.4.1) - [4.4.2神经网络的梯度 ](#4.4.2) - [4.5学习算法的实现 ](#4.5) - [4.5.1 2层神经网络的类 ](#4.5.1) - [4.5.2 $\textbf{mini\_batch}$ 的实现](#4.5.2) - [4.5.3基于测试数据的评价](#4.5.3) - [第五章误差反向传播法](#5) - [5.1计算图](#5.1) - [5.2链式法则](#5.2) - [5.2.1计算图的反向传播](#5.2.1) - [5.2.2链式法则和计算图](#5.2.2) - [5.3反向传播](#5.3) - [5.3.1加法节点的反向传播](#5.3.1) - [5.3.2乘法节点的反向传播](#5.3.2) - [5.4简单层的实现](#5.4) - [5.4.1乘法层的实现和加法层的实现](#5.4.1) - [5.5激活层的实现](#5.5) - [5.5.1ReLU层](#5.5.1) - [5.5.2Sigmoid层](#5.5.2) - [5.6Affine/Softmax层的实现](#5.6) - [5.6.1Affine层](#5.6.1) - [5.6.2批版本的Affine层](#5.6.2) - [5.6.3Softmax-with-Loss层](#5.6.3) - [5.7误差反向传播法的实现](#5.7) - [5.7.1神经网络学习全貌图](#5.7.1) - [5.7.2对应误差反向传播发的神经网络的实现](#5.7.2) - [5.7.3误差反向传播法的梯度确认](#5.7.3) - [5.7.4使用误差反向传播法的学习](#5.7.4) ---

一、前言

准备工作

该笔记内容来自于《 **深度学习入门-基于Python的理论与实现**》这本书，主要内容与原书有所更改。为了便于运行书中的代码，需要提前安装的**Python**库有 - `numpy # pip3 install numpy -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com` - `matplotlib # pip3 install matplotlib` ---

二、内容介绍

第一章 Python入门

1.Matplotlib

1.1.1pyplot

> 可以显示函数图像 ```python import numpy as np import matplotlib.pyplot as plt x = np.arange(0, 6, 0.1) y1 = np.sin(x) y2 = np.cos(x) plt.plot(x, y1, label = "sin") plt.plot(x, y2, linestyle = '--', label = "cos") plt.xlabel("x") plt.ylabel("y") plt.title("sin&cos") plt.legend() plt.show() ```

1.1.2显示图像

```python import matplotlib.pyplot as plt from matplotlib.image import imread img = imread('lena.png') plt.imshow(img) plt.show() ``` **如果你想显示更多类型的图像** ```python pip3 install pillow import PIL ```

第二章感知机

2.1感知机是什么

> 感知机接受多个输入信号，输出一个信号。感知机的信号只有“流/不流”(1/0)两种取值。$x_1$和$x_2$是输入信号，$w_1$和$w_2$是权重。输入信号被送到神经元时，分别被乘以固定的权重。神经元会计算传送过来的的信号总和，当和大于当个界限值时，才会输出1。也被称为“**神经元被激活**”。这个界限值也称为**阈值**，用符号$\theta$表示。用数学符号描述入下： $$ y = \begin{cases} 0 &(w_1x_1+w_2x_2 \leq \theta) \\ 1 &(w_1x_1+w_2x_2 > \theta) \end{cases} \tag{2.1} $$

2.2感知机的实现

2.2.1导入权重和偏置

> 把公式(2.1)的$\theta$换为$-b$可得式(2.2) $$ y = \begin{cases} 0 &(b+w_1x_1+w_2x_2 \leq 0) \\ 1 &(b+w_1x_1+w_2x_2 > 0) \end{cases} \tag{2.2} $$ 其中，b称为**偏置**

2.2.2与门、与非门和或门的实现

```python import numpy as np def AND(x1, x2): x = np.array([x1, x2]) w = np.array([0.5, 0.5]) b = -0.7 tmp = np.sum(w * x) + b if tmp <= 0: return 0 else: return 1 def NAND(x1, x2): x = np.array([x1, x2]) w = np.array([-0.5, -0.5]) b = 0.7 tmp = np.sum(w * x) + b if tmp <= 0: return 0 else: return 1 def OR(x1, x2): x = np.array([x1, x2]) w = np.array([0.5, 0.5]) b = -0.2 tmp = np.sum(w * x) + b if tmp <= 0: return 0 else: return 1 if __name__ == '__main__': print(NAND(0, 1)) ```

2.3感知机的局限性

> 感知机无法实现异或门是由于感知机的局限性将或门动作形象化。假设权重参数为$(b,w_1,w_2)=(-0.5,1.0,1.0)$，则感知机可用下列式子表示： $$ y = \begin{cases} 0 &(-0.5+x_1+x_2 \leq 0) \\ 1 &(-0.5+x_1+x_2 > 0) \end{cases} \tag{2.3} $$ 式(2.3)表示的感知机会生成由直线$-0.5+x_1+x_2=0$分隔开的两个空间。其中一个空间输出1，另一个空间输出0。如图2.1所示 ![image](https://note.youdao.com/yws/api/personal/file/WEBc35a15e6400aa99160843eb6161c5924?method=download&shareKey=074dc3b1ea4228e2e290bced9a8fea4f) **或门可以由一条直线分开** 对于异或门来说： ![image](https://note.youdao.com/yws/api/personal/file/WEBc07ec021c6b1406f8b3c7abb17f09234?method=download&shareKey=1e231d79f3932938b17bf243ce7c30f0) 无法用一条直线分开。

2.4多层感知机

> 感知机无法表示异或门，但感知机的微妙之处在于它可以“**叠加层**”，异或门可以由与门、与非门和或门叠加实现。 ```python def XOR(x1, x2): s1 = NAND(x1, x2) s2 = OR(x1, x2) y = AND(s1, s2) return y ``` 异或门是一种**多层结构**的**神经网络**。叠加了多层的感知机也称为**多层感知机**。

第三章神经网络

3.1从感知机到神经网络

3.1.1神经网络、感知机和激活函数

对于一个神经网络我们把最左边一列称为**输入层**，中间一列称为**中间层**，最右边一列称为**输出层** ![image](https://note.youdao.com/yws/api/personal/file/WEBadfb7d80a48bf238aef51fe727743057?method=download&shareKey=6d18d89002f8d5c6bd1da1fef1866808) 我们把式(2.2)改写为以下形式 $$ y =h(b+w_1x_1+w_2x_2) \tag{3.1} $$ $$ h(x) = \begin{cases} 0 &(x \leq 0) \\ 1 &(x > 0) \end{cases} \tag{3.2} $$ 式(3.1)中，输入信号会被函数$h(x)$转换，转换后的值就是输出$y$。我们一般将例如$h(x)$这种会将输入信号的总和转换为输出信号的函数称为**激活函数**。激活函数的作用就是决定如何来激活输入信号的总和。进一步，我们将式(3.1)分为两个阶段处理，先计算输入信号加权总和，然后用激活函数转换这一个过程。 $$ a = b+w_1x_1+w_2x_2 \tag{3.3} $$ $$ y = h(a) \tag{3.4} $$ 整个过程如图3.2所示 ![image](https://note.youdao.com/yws/api/personal/file/WEB7c828accde4cfbd98eac0da4c8034c41?method=download&shareKey=2bfbf99d495de393f75ae5c9b16f346d)

3.2激活函数

3.2.1sigmoid函数

> 神经网络中经常使用的一个激活函数就是式(3.5)表示的**sigmoid函数**。 $$ h(x) = \frac{1}{1+exp(-x)} \tag{3.5} $$ 式中$exp(-x)$表示$e^{-x}$的意思。神经网络中用sigmoid函数作为激活函数，进行信号的转换，转换的信号被送到下一个神经元。实际上，感知机和神经网络的主要区别就在于激活函数。

3.2.2阶跃函数、sigmod图形表示

> 当输入超过0时输出1，否则输出0 考虑Numpty数组 ```python def step_function(x): y = x > 0 #会生成一个bool的Numpy return y.astype(np.int) # 将bool转变为int ``` 阶跃函数和sigmod的图形表示 ```python #coding: utf-8 import numpy as np import matplotlib.pylab as plt def sigmoid(x): return 1 / (1 + np.exp(-x)) def step_function(x): return np.array(x > 0, dtype=np.int) x = np.arange(-5.0, 5.0, 0.1) y1 = sigmoid(x) y2 = step_function(x) plt.plot(x, y1) plt.plot(x, y2, 'k--') plt.ylim(-0.1, 1.1) #指定图中绘制的y轴的范围 plt.show() ```

3.2.3ReLU函数

ReLU函数可以表示为下面的式子： $$ h(x) = \begin{cases} x &(x > 0) \\ 0 &(x \leq 0) \end{cases} \tag{3.6} $$ ```python def relu(x): return np.maximum(0,x) ```

3.3三层网络设计实现

代码实现如下： ```python import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def identity_function(x): return x def init_network(): network = {} network['W1'] = np.array([[0.1, 0.3, 0.5], [0.2, 0.4, 0.6]]) network['b1'] = np.array([0.1, 0.2, 0.3]) network['W2'] = np.array([[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]]) network['b2'] = np.array([0.1, 0.2]) network['W3'] = np.array([[0.1, 0.3], [0.2, 0.4]]) network['b3'] = np.array([0.1, 0.2]) return network def forward(network, x): W1, W2, W3 = network['W1'], network['W2'], network['W3'] b1, b2, b3 = network['b1'], network['b2'], network['b3'] a1 = np.dot(x, W1) + b1 z1 = sigmoid(a1) a2 = np.dot(z1, W2) + b2 z2 = sigmoid(a2) a3 = np.dot(z2, W3) + b3 y = identity_function(a3) return y network = init_network() x = np.array([1.0, 0.5]) y = forward(network, x) print(y) ``` > 这里定义的$init\_work()$和$forward()$函数。$\textbf{init\_work()}$会进行**权重和偏置的初始化**，并将他们保存在字典变量$network$中。这个字典变量$network$保存了每一层所需的参数(权重的偏置)。$forward()$函数中则装了**输入信号转换为输出信号**的处理过程。 > 输出层使用的激活函数，要根据求解问题的性质决定。一般的，$\textbf{回归问题}$可以使用**恒等函数**，$\textbf{二元分类}$问题可以使用**sigmoid函数**，$\textbf{多元分类}$问题可以使用**softmax函数**

3.4输出层设计

3.4.1恒等函数和softmax函数

> 恒等函数会将输入按照原样输出。对于$\textbf{softmax}$，假设输出层共有$n$个神经元，计算第$k$个神经元的输出$y_k$。 $$ y_k = \frac{exp(a_k)}{ \sum^n_{i = 1}exp(a_i)} \tag{3.7} $$ 计算机在计算过程中给存在一定的缺陷，就是溢出。将式(3.7)进行如下的改进： $$ y_k = \frac{exp(a_k)}{ \sum^n_{i = 1}exp(a_i)} = \frac{Cexp(a_k)}{ C\sum^n_{i = 1}exp(a_i)} \\ = \frac{exp(a_k + logC)}{ \sum^n_{i = 1}exp(a_i + logC)} \\ = \frac{exp(a_k + C')}{ \sum^n_{i = 1}exp(a_i + C')} \tag{3.8} $$ 这里的 $xyz$ 可以使用任意值，一般使用 $\textbf{输入值}$ 的**最大值**。 ```python def softmax(a): c = np.max(a) exp_a = exp(a - c) # 溢出对策 sum_exp_a = np.sum(exp_a) y = exp_a / sum_exp_a return y ``` > softmax函数的输出是0.0到1.0之间的实数，且**总和为1**，这是softmax一个重要的性质。有了这个性质，我们可以把softmax函数的输出解释为“概率”。

3.5手写数字体识别

3.5.1MNIST数据集

> MNIST数据集是机器学习领域最有名的数据集之一，其图像数据是28像素 $\times$ 28像素的灰度图像（1通道）娴熟MNIST图像 ```python #ch03/mnist_show.py #coding: utf-8 import sys, os sys.path.append(os.pardir) # 为了导入父目录的文件而进行的设定 import numpy as np from dataset.mnist import load_mnist from PIL import Image def img_show(img): pil_img = Image.fromarray(np.uint8(img)) pil_img.show() (x_train, t_train), (x_test, t_test) = load_mnist(flatten=True, normalize=False) img = x_train[0] label = t_train[0] print(label) # 5 print(img.shape) # (784,) img = img.reshape(28, 28) # 把图像的形状变为原来的尺寸 print(img.shape) # (28, 28) img_show(img) ``` > load_mnist函数以“**（训练图像，训练标签），（测试图像，测试标签）** ”的形式返回读入的MNIST数据$load_mnist(normalize=True, flatten=True, one\_hot\_label=Flase)$有三个参数,分别代表着： - **normalize**:是否将输入图像正规化为0.0~1.0。False则保持为0~255。 - **faltten**：是否展开输入图像（变成一维数组）。False则保持1$\times$28$\times$28的三维数组；True则为754个元素组成的一维数组。 - **one_hot_label**:是否将标签保存为one_hot(仅正确解标签为1，其余皆为0的数组)，例如[0,0,1,0,0,0,0,0,0,0]。False只是像7、2这样简单保存正确解标签。

3.5.2神经网络的的推理处理

```python #ch03/neuralnet_mnist.py #coding: utf-8 import sys, os sys.path.append(os.pardir) # 为了导入父目录的文件而进行的设定 import numpy as np import pickle from dataset.mnist import load_mnist from common.functions import sigmoid, softmax def get_data(): (x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, flatten=True, one_hot_label=False) return x_test, t_test def init_network(): with open("sample_weight.pkl", 'rb') as f: network = pickle.load(f) return network def predict(network, x): W1, W2, W3 = network['W1'], network['W2'], network['W3'] b1, b2, b3 = network['b1'], network['b2'], network['b3'] a1 = np.dot(x, W1) + b1 z1 = sigmoid(a1) a2 = np.dot(z1, W2) + b2 z2 = sigmoid(a2) a3 = np.dot(z2, W3) + b3 y = softmax(a3) return y x, t = get_data() network = init_network() accuracy_cnt = 0 for i in range(len(x)): y = predict(network, x[i]) p= np.argmax(y) # 获取概率最高的元素的索引 if p == t[i]: accuracy_cnt += 1 print("Accuracy:" + str(float(accuracy_cnt) / len(x))) ``` > **init_network()** 会读入保存在pickle文件的sample_weight.pkl中的学习到的权重参数。**predit()** 函数以NumPy数组的形式输出各个标签对应的概率。我们将数据限定到某个范围称为 **正规化** (normalization)。对神经网络的输入数据进行某种既定的转换称为 **预处理** (pre-processing)。

3.5.3批处理

> 若打包输入多张图片，则将这种打包式的输入数据称为**批（batch）**。批处理可以大幅度缩小每张图片的处理时间。带有batch的数据处理 ```python #ch03/neuralnet_mnist_batch.py #coding: utf-8 import sys, os sys.path.append(os.pardir) # 为了导入父目录的文件而进行的设定 import numpy as np import pickle from dataset.mnist import load_mnist from common.functions import sigmoid, softmax def get_data(): (x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, flatten=True, one_hot_label=False) return x_test, t_test def init_network(): with open("sample_weight.pkl", 'rb') as f: network = pickle.load(f) return network def predict(network, x): w1, w2, w3 = network['W1'], network['W2'], network['W3'] b1, b2, b3 = network['b1'], network['b2'], network['b3'] a1 = np.dot(x, w1) + b1 z1 = sigmoid(a1) a2 = np.dot(z1, w2) + b2 z2 = sigmoid(a2) a3 = np.dot(z2, w3) + b3 y = softmax(a3) return y x, t = get_data() network = init_network() batch_size = 100 # 批数量 accuracy_cnt = 0 for i in range(0, len(x), batch_size): x_batch = x[i:i+batch_size] y_batch = predict(network, x_batch) p = np.argmax(y_batch, axis=1) accuracy_cnt += np.sum(p == t[i:i+batch_size]) print("Accuracy:" + str(float(accuracy_cnt) / len(x))) ```

第四章神经网络的学习

4.1从数据中学习

> 神经网络的特征就是**从数据中学习**。所谓的从数据中学习就是有数据自动决定**权重参数** 的值。

4.1.1数据驱动

> 设计一个识别"5"的程序，一种方法是从图像中提取**特征量**，再用机器学习技术学习这些特征量的模式。这里的“特征量”是只可以从数据(输入图像)中准确地提取本质数据(重要的数据)的转换器。在**计算机视觉领域**，常用的特征量包括**SIFT、SURF、HOG**等。使用这些特征量将图像数据转换**向量**，然后对转换后的向量使用机器学习中的**SVM、KNN**等分类器进行学习。 > **机器学习** 的方法中，由机器从**收集到**的数据中找出规律，这种方法可以更高效的解决问题，但是其中的特征量还是由人来设计。对不同的问题需要设计不同的特征量。 > **神经网络(深度学习)** 直接学习图像本身，特征量也是由机器来学习的。神经网络的优点是对所有的问题都可以用同样的流程来解决。

4.1.2训练数据和测试数据

> 机器学习中，一般将数据分为**训练数据(监督数据)**和**测试数据**两部分来进行学习和实验等。将数据分为这两部分的原因是追求模型的**泛化能力**，泛化能力是指**处理未被观察过的数据（不包括在训练数据中的数据）的能力**。获得泛化能力是机器学习最终的目标。只对某一个数据集过度拟合的状态称为**过拟合**，机器学习一个重要的课题就是避免过拟合。

4.2损失函数

> 神经网络的学习通过某个指标表示现在的状态，然后以这个指标为基准，寻找最优权重参数。这个指标被称为**损失函数**。损失函数可以使用任意函数，但一般采用**均方误差**和**交叉熵误差**等。 > 损失函数是表示神经网络性能的“恶劣程度”的指标，即当前的神经网络对监督数据在多大程度上不拟合。，在多大程度上不一致。给损失函数乘上一个负值，就可以解释为“在多大程度上不坏”，即“性能有多好”。

4.2.1均方误差

均方误差的表达式为 $$ E=\frac{1}{2}\sum_k{(y_k-t_k)^2} \tag{4.1} $$ > 这里的$y_k$表示神经网络的输出，$t_k$表示监督学习，$k$表示维数。$E$值越小，误差越小。 ```python def mean_squared_error(y, t): return 0.5 * np.sum((y - t)**2) ```

4.2.2交叉熵误差

交叉熵表达式如下： $$ E=-\sum_k{t_klog \, y_k} \tag{4.2} $$ > $log$表示自然对数$log_e$。$y_k$是神经网络的输出，$t_k$是$one\_hot$表示的正确解标签。正确概率越大，$E$越大接近0。 ```python def cross_entropy_error(y, t): delta = 1e-7 return -np.sum(t * np.log(y + delta)) ```

4.2.3mini-batch学习

> 机器学习使用训练数据进行学习，在计算损失函数时必须将所有的训练数据作为对象。也就是说，如果训练数据有100个的话，我们就要把这100个损失函数的总和作为学习的指标。以交叉熵误差为例： $$ E = -\frac{1}{N}\sum_n\sum_kt_{nk}\,log\,y_{nk} \tag{4.3} $$ > $t_{nk}$表示第$n$个数据的第$k$个元素的值。神经网络的学习也是从训练数据中选出一批数据（称为mini-batch，小批量），这种学习方式称为**mini-batch学习**。在训练数据中随机选取若干数字可以使用$\textbf{np.random.choice(n, k)}$,它会从$0$到$n-1$之间随机选择$k$个数字。返回的结果是一个$np.array()$ ```python train_size = x_train_shape[0] batch_size = 10 batch_mask = np.random.choice(train_size, batch_size) x_batch = x_train[batch_mask] t_batch = t_train[batch_mask] ``` > $mini\_batch$的损失函数也是利用一部分样本数据来近似地计算整体。也就是说，用随机选择的**小批量数据(mini-batch)** 作为全体训练数据的近似值。

4.2.4mini-batch版交叉熵的实现

当监督数据是$one\_hot$形式： ```python def cross_entropy(y, t): if y.ndim == 1: t = t.reshape(1, t.size) y = y.reshape(1, y.size) batch_size = y.shape[0] return -np.sum(t * np.log(y + 1e-7)) / batch_size ``` 当监督学习是标签形式 ```python def cross_entropy(y, t): if y.ndim == 1: t = t.reshape(1, t.size) y = y.reshape(1, y.size) batch_size = y.shape[0] return -np.sum(t * np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size ```

4.2.5设定损失函数的实际意义

> 假设某个神经网络正确识别出了100笔中32笔，此时识别精度为32%。如果以识别精度为指标，即使稍微改变权重参数的值，识别精度也将保持在32%，不会出现新的变化。也就是说采用损失函数，损失函数的值是连续的，而不是离散的。

4.3数值微分

4.3.1导数

> 为了减小误差，采用**中心差分**计算导数 ```python def numerical_diff(f, x): h = 1e-4 #0.0001 return (f(x + h) - f(x - h)) / (2 * h) ``` > 如上所示，利用微小的差分求导数的过程称为**数值微分**(numerical_difference)。而基于数学式的推导求导数的过程称为“**解析行求解**”。比如$y=x^2$可以通过$\frac{dy}{dx}=2x$解析性的求解出来。这样求出来的导数是不含误差的"真的导数"。

4.4梯度

> 多个变量的函数的导数称为**偏导数**。用数学式表达的话，可以表示成$\frac{\partial f}{\partial x_0}$、$\frac{\partial f}{\partial x_1}$。像$(\frac{\partial f}{\partial x_0}$,$\frac{\partial f}{\partial x_1})$这样的由全部偏导数汇总而成的向量称为**梯度**。梯度指示的方向是**各点处的函数值减小最多的方向。** ```python def numerical_gradient(f, x): h = 1e-4 # 0.0001 grad = np.zeros_like(x) for idx in range(x.size): tmp_val = x[idx] x[idx] = float(tmp_val) + h fxh1 = f(x) # f(x+h) x[idx] = tmp_val - h fxh2 = f(x) # f(x-h) grad[idx] = (fxh1 - fxh2) / (2*h) x[idx] = tmp_val # 还原值 return grad ```

4.4.1梯度法

> 机器学习和神经网络的主要任务是寻找最优参数，即寻找损失函数取最小值时的参数。通过梯度来寻找函数最小值(或者尽可能小的值)的方法就是**梯度法**。梯度表示的是各点处的函数值减小最多的方向。因此无法保证梯度所指的方向就是函数的最小值或者真正应给前进的方向。实际上，在复杂函数中，梯度指示的方向基本上都不是函数值最小处。 > 虽然梯度的方向并不一定只想最小值，但沿着它的方向能偶最大限度地减小函数的值。在梯度法中，函数的取值从当前位置沿着梯度的方向前进一定距离，然后在新的地方重新求梯度，再沿着新梯度的方向前进，如此反复，不断地沿着梯度方向前进，逐渐减小函数值的过程就是**梯度法**。寻找最小值称为**梯度下降法**，相反称为**梯度上升法**。一般来说，神经网络(深度学习)中，梯度法主要是指梯度下降法。用数学式来表示梯度法： $$ x_0 = x_0-\eta\frac{\partial f}{\partial x_0} \\ x_1 = x_1-\eta\frac{\partial f}{\partial x_1} \tag{4.4} $$ $\eta$表示更新量，在神经网络中称为**学习率**。 ```python def gradient_descent(f, init_x, lr=0.01, step_num=100): x = init_x x_history = [] for i in range(step_num): x_history.append( x.copy() ) grad = numerical_gradient(f, x) x -= lr * grad return x, np.array(x_history) ```

4.4.2神经网络的梯度

> 神经网络的学习也需要梯度，这里的梯度是指损失函数关于权重参数的梯度。例如有一个形状为 $2\times 3$ 的权重$\textbf{W}$的神经网络，损失函数用$L$表示，此时梯度可以用$\frac{\partial L}{\partial \textbf{W}}$。用数学式表示： $$ \textbf{W}=\begin{pmatrix} w_{11} \, w_{12} \, w_{13} \\ w_{21} \, w_{22} \, w_{23} \end{pmatrix} \\ \frac{\partial L}{\partial \textbf{W}} = \begin{pmatrix} \frac{\partial L}{\partial w_{11}} \, \frac{\partial L}{\partial w_{12}} \, \frac{\partial L}{\partial w_{13}} \\ \frac{\partial L}{\partial w_{21}} \, \frac{\partial L}{\partial w_{22}} \, \frac{\partial L}{\partial w_{23}} \end{pmatrix} \tag{4.5} $$ 接下来以一个简单的神经网络为例，来实现求梯度的代码。为此，我们要实现一个名为$simpleNet$的类： ```python #coding: utf-8 import sys, os sys.path.append(os.pardir) # 为了导入父目录中的文件而进行的设定 import numpy as np from common.functions import softmax, cross_entropy_error from common.gradient import numerical_gradient class simpleNet: def __init__(self): self.W = np.random.randn(2,3) def predict(self, x): return np.dot(x, self.W) def loss(self, x, t): z = self.predict(x) y = softmax(z) loss = cross_entropy_error(y, t) return loss ``` 测试一下： ```python >>> net = simpleNet() >>> print(net.W) [[0.47355232 0.9977393 0.84668094], [0.85557411 0.03563661 0.69422093]] >>> x = np.array([0.6, 0.9]) >>> p = net.predict(x) >>> print(p) [1.05414809 0.63071653 1.1328071] >>> np.argmax(p) 2 >>> t = np.array([0, 0, 1]) >>> net.loss(x, t) 0.92806853663411326 ``` 接下来求梯度(这里定义的$F(W)$是一个伪参数。因为$numerical\_gradient(f,x)$中的$f$是函数，为了与之兼容而定义了$f(W))$: ```python def f(W): ... return net.loss(x, t) ... >>> dW = numerical_gradient(f, net.W) >>> print(dW) [[0.21924763 0.14356247 -0.36281009], [0.32887144 0.2153437 -0.54421514]] ``` $numerical\_gradient(f, net.W)$的结果是$dW$，一个形状为$2\times 3$的二维数组，观察一下$dW$的内容，例如$\frac{\partial L}{\partial \textbf{W}}$中的$\frac{\partial L}{\partial w_{11}}$的值大概是0.2，这表示如果将$w_{11}$增加$h$那么损失函数的只会增加$0.2h$。再如$\frac{\partial L}{\partial w_{23}}$的值大概是-0.5，这表示如果将$w_{23}$增加$h$那么损失函数的值将减小$0.5h$。因此，从减小损失函数的角度看，$w_{23}$应向正方向更新，$w_{11}$应向负方向更新。至于更新程度，$w_{23}$要比$w_{11}$的贡献要大。另外上述代码中使用了$"def \, f(x):..."$的形式，实际上，$python$中如果定义简单函数，可以使用$lamba$表达式： ```python >>> f = lamba w:net.loss(x,t) >>> dW = numerical_gradient(f, net.W) ```

4.5学习算法的实现

神经学习的步骤如下 > **前提** 神经网络存在合适的权重和偏置，调整权重和偏置以便拟合训练数据的过程称为“学习”。神经网络的学习分为下面4个步骤。 **步骤1**( $\textbf{mini\_batch}$ ) 从训练数据中随机选出一部分数据，这部分数据称为 $\textbf{mini\_batch}$。我们的目标是减小 $\textbf{mini\_batch}$ 的损失函数的值。 **步骤2(计算梯度)** 为了减小 $\textbf{mini\_batch}$ 的损失函数的值，需要求出各个权重参数的梯度。梯度表示损失函数的值减小最多的方向。 **步骤3(更新参数)** 将权重参数沿梯度方向进行微小的更新。 **步骤4重复** 重复步骤1、步骤2、步骤3。接下来实现手写数字提识别的神经网络。以2层神经网路（隐藏层为1层的网络）为对象，使用$\textbf{MNIST}$数据集进行学习。

4.5.1 2层神经网络的类

首先我们将这个2层神经网络实现为一个名为 $\textbf{TwoLayerNet}$ 的类，实现过程如下： ```python #ch04/two_layer_met.py #coding: utf-8 import sys, os sys.path.append(os.pardir) # 为了导入父目录的文件而进行的设定 from common.functions import * from common.gradient import numerical_gradient class TwoLayerNet: def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01): # 初始化权重 self.params = {} self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size) self.params['b1'] = np.zeros(hidden_size) self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size) self.params['b2'] = np.zeros(output_size) def predict(self, x): W1, W2 = self.params['W1'], self.params['W2'] b1, b2 = self.params['b1'], self.params['b2'] a1 = np.dot(x, W1) + b1 z1 = sigmoid(a1) a2 = np.dot(z1, W2) + b2 y = softmax(a2) return y # x:输入数据, t:监督数据 def loss(self, x, t): y = self.predict(x) return cross_entropy_error(y, t) def accuracy(self, x, t): y = self.predict(x) y = np.argmax(y, axis=1) t = np.argmax(t, axis=1) accuracy = np.sum(y == t) / float(x.shape[0]) return accuracy # x:输入数据, t:监督数据 def numerical_gradient(self, x, t): loss_W = lambda W: self.loss(x, t) grads = {} grads['W1'] = numerical_gradient(loss_W, self.params['W1']) grads['b1'] = numerical_gradient(loss_W, self.params['b1']) grads['W2'] = numerical_gradient(loss_W, self.params['W2']) grads['b2'] = numerical_gradient(loss_W, self.params['b2']) return grads def gradient(self, x, t): W1, W2 = self.params['W1'], self.params['W2'] b1, b2 = self.params['b1'], self.params['b2'] grads = {} batch_num = x.shape[0] # forward a1 = np.dot(x, W1) + b1 z1 = sigmoid(a1) a2 = np.dot(z1, W2) + b2 y = softmax(a2) # backward dy = (y - t) / batch_num grads['W2'] = np.dot(z1.T, dy) grads['b2'] = np.sum(dy, axis=0) da1 = np.dot(dy, W2.T) dz1 = sigmoid_grad(a1) * da1 grads['W1'] = np.dot(x.T, dz1) grads['b1'] = np.sum(dz1, axis=0) return grads ```

表 4-1 TwolayerNet类中使用的变量

变量	说明
params	保存神经网络的参数的字典变量（实例变量） params[ 'W1' ]是第1层的权重，params['b1']是第1层的偏置。 params[ 'W2' ]是第2层的权重，params['b2']是第2层的偏置。
grads	保存梯度的字典变量(numerical_gradient()方法的返回值) grads[ 'W1' ]是第1层的权重的梯度，grads['b1']是第1层的偏置的梯度。 grads[ 'W2' ]是第2层的权重的梯度，grads['b2']是第2层的偏置的梯度。

表 4-2 TwolayerNet类中的方法

方法	说明
__init__(self, input_size, hidden_size,output_size)	进行初始化参数从头开始依次表示输入层的神经元数、隐藏层的神经元数、输出层的神经元数。
predict(self, x)	进行识别(推理) 参数x是图像数据。
loss(self, x, t)	计算损失函数的值参数x是图像数据，t是正确解标签(后面3个方法的参数一样)
accuracy(self, x, t)	计算识别精度
numerical_gradient(self, x, t)	计算权重参数的梯度
gradient(self, x, t)	计算权重参数的梯度 numerical_gradient()的高速版，将在下一章实现。

> $\textbf{numerical\_gradient(self,x,t)}$基于数值微分计算参数的梯度。下一章会介绍一个高速计算梯度的方法，称为**误差反向传播法**。

4.5.2mini_batch的实现

> 神经网络的学习的实现使用的是前面讲过的$\textbf{mini\_batch}$学习，从训练数据中随机选择一部分数据，再以这些数据为对象，使用梯度法更新参数。 ```python #ch04\train_neuralnet.py #coding: utf-8 import sys, os sys.path.append(os.pardir) # 为了导入父目录的文件而进行的设定 import numpy as np import matplotlib.pyplot as plt from dataset.mnist import load_mnist from two_layer_net import TwoLayerNet #读入数据 (x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True) network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10) iters_num = 10000 # 适当设定循环的次数 train_size = x_train.shape[0] batch_size = 100 learning_rate = 0.1 train_loss_list = [] for i in range(iters_num): batch_mask = np.random.choice(train_size, batch_size) x_batch = x_train[batch_mask] t_batch = t_train[batch_mask] # 计算梯度 #grad = network.numerical_gradient(x_batch, t_batch) grad = network.gradient(x_batch, t_batch) # 更新参数 for key in ('W1', 'b1', 'W2', 'b2'): network.params[key] -= learning_rate * grad[key] loss = network.loss(x_batch, t_batch) train_loss_list.append(loss) ``` > 这里使用的是随机梯度下降法(SGD)更新参数。

4.5.3基于测试数据的评价

> 通过上一个代码呈现的结果，我们确认了通过反复的学习可以使损失函数的值逐渐减小这一事实。但这只是在一个$\textbf{mini\_batch}$上的结果，并不能保证是在其他数据集上也能保持这样的结果。神经网络的学习中，必须确认是否能够正确识别训练数据以外的其他数据，即确认是否会发生过拟合。因此要评价网络的泛化能力，就必须使用不包括在训练数据中的数据。 > $\textbf{epoch}$是一个单位，一个$\textbf{epoch}$表示学习中所有训练数据均被使用过一次时的更新次数。比如，对于10000笔训练数据，用大小为100笔数据的$mini\_batch$进行学习时，重复随机梯度下降法100次，所有训练数据就都被“看过”了。此时，100次就是一个$\textbf{epoch}$。实际上，一般做法是将所有训练数据随机打乱，然后按照指定的批次大小，按序生成$\textbf{mini\_batch}$，这样每个$\textbf{mini\_batch}$均有一个索引号，然后按照索引号可以遍历所有的$\textbf{mini\_batch}$，遍历一次所有数据就是一个 $\textbf{epoch}$。计入计算每个 $\textbf{epoch}$的计算精度代码如下： ```python #ch04\train_neuralnet.py import sys, os sys.path.append(os.pardir) # 为了导入父目录的文件而进行的设定 import numpy as np import matplotlib.pyplot as plt from dataset.mnist import load_mnist from two_layer_net import TwoLayerNet

读入数据

(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True) network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10) iters_num = 10000 # 适当设定循环的次数 train_size = x_train.shape[0] batch_size = 100 learning_rate = 0.1 train_loss_list = [] train_acc_list = [] test_acc_list = [] iter_per_epoch = max(train_size / batch_size, 1) for i in range(iters_num): batch_mask = np.random.choice(train_size, batch_size) x_batch = x_train[batch_mask] t_batch = t_train[batch_mask] # 计算梯度 #grad = network.numerical_gradient(x_batch, t_batch) grad = network.gradient(x_batch, t_batch) # 更新参数 for key in ('W1', 'b1', 'W2', 'b2'): network.params[key] -= learning_rate * grad[key] loss = network.loss(x_batch, t_batch) train_loss_list.append(loss) if i % iter_per_epoch == 0: train_acc = network.accuracy(x_train, t_train) test_acc = network.accuracy(x_test, t_test) train_acc_list.append(train_acc) test_acc_list.append(test_acc) print("train acc, test acc | " + str(train_acc) + ", " + str(test_acc)) #绘制图形 markers = {'train': 'o', 'test': 's'} x = np.arange(len(train_acc_list)) plt.plot(x, train_acc_list, label='train acc') plt.plot(x, test_acc_list, label='test acc', linestyle='--') plt.xlabel("epochs") plt.ylabel("accuracy") plt.ylim(0, 1.0) plt.legend(loc='lower right') plt.show() ``` > 在上面的例子中，每经过一个$\textbf{epoch}$，就对所有的训练数据和测试数据计算识别精度并记录结果。

图 4-1 训练数据和测试数据的识别精度的推移。

![image](https://note.youdao.com/yws/api/personal/file/WEB3224d7af00f5b5013d61a281a2dcc505?method=download&shareKey=79aed883779fbff8cf125df70a68ca7f)

第五章误差反向传播法

> 本章我们将学习一个能够高效计算权重参数的梯度的方法-**误差反向传播法**

5.1计算图

> 计算图可以集中精力与**局部计算**，同时使用计算图可以**通过反向传播高效计算导数**。

5.2链式法则

> 反向传播将局部导数向正方向的反方向（从右向左）传递，传递这个导数的原理，是基于**链式法则**（chain rule）。

5.2.1计算图的反向传播

假设存在$y=f(x)$的计算，这个计算的反向传播如图5-1所示： ![image](https://note.youdao.com/yws/api/personal/file/WEB1f5b58c0021a720cc0e67a7addf7bb09?method=download&shareKey=7794a280aaabc41de9d6be07aae5d5d7) > 如图所示，反向传播的计算顺序是：将信号$E$乘以节点的局部导数$(\frac{\partial y}{\partial x})$，然后将结果传递给下一个节点。这里的**局部导数**是指正向传播中$y=f(x)$即$y$关于$x$的导数$(\frac{\partial y}{\partial x})$。 > 链式法则是关于**复合函数**表示，定义如下：如果某个函数由复合函数表示，则该复合函数的导数可以用构成复合函数的各个函数的导数的乘积表示。以$z=(x+y)^2, t = x+y$为例， $$ \frac{\partial z}{\partial x}=\frac{\partial z}{\partial t}\frac{\partial t}{\partial x} \tag{5.1} $$ 使用链式法则求式(5.2)的导数$\frac{\partial z}{\partial x}$ $$ \frac{\partial z}{\partial t}=2t \\ \frac{\partial t}{\partial x}=1 \\ \frac{\partial z}{\partial x}=\frac{\partial z}{\partial t}\frac{\partial t}{\partial x} =2t \cdot 1 = 2(x+y) \tag{5.2} $$

5.2.2链式法则和计算图

以 $\textbf{**2}$ 节点表示平方运算 ![image](https://note.youdao.com/yws/api/personal/file/WEB08da59982f58d7c513bdbf32ea31cbc9?method=download&shareKey=1af04d96db21691533a1478b5ac8f172) > 反向传播的计算顺序是：先将节点的输入信号乘以节点的局部导数(偏导数)，然后再传递给下一个节点。比如反向传播时，$\textbf{**2}$节点的输入是$\frac{\partial z}{\partial z}$，将其乘以局部导数$\frac{\partial z}{\partial t}$(因为正向传播时输入是$t$、输出是$z$,所以这个节点的局部导数是$\frac{\partial z}{\partial t}$)，然后传递给下一个节点。将式(5.2)的结果代入图5.2中，结果如图5.3所示： ![image](https://note.youdao.com/yws/api/personal/file/WEBd4a7bbd3e45d636f8944ed91bbcdf72a?method=download&shareKey=8213045def8a222cfbc22b7ca72b9453)

5.3反向传播

> 本节将以 $+$ 和 $\times$ 等运算为例，介绍反向传播的结构

5.3.1加法节点的反向传播

对于加法节点来说，以 $z=x+y$ 为对象观察他的反向传播 $$ \frac{\partial z}{\partial x} = 1 \\ \frac{\partial z}{\partial y} = 1 \tag{5.3} $$ 可知，导数都等于1 > 因此加法节点的反向传播只是将输入信号输出到下一个节点。

5.3.2乘法节点的反向传播

以 $x=xy$ 为例，其导数形式为 $$ \frac{\partial z}{\partial x} = y \\ \frac{\partial z}{\partial y} = x \tag{5.4} $$ > 乘法的反向传播会将上游的值乘以正向传播时的输入信号的“**翻转值**”后传给下游。翻转值表示一种反转关系，正向传播是 $x$ 的话，反向传播就是 $y$ 。

5.4简单层的实现

> 我们把要实现的计算图的乘法节点称为“**乘法层**”（`MulLayer`）,加法节点称为“**加法层**”（`AddLayer`）。下一节，我们将把构建神经网络的“层”实现为一个类。这里所说的“层”是神经网络中功能的单位。比如，负责`Sigmoid`函数的**Sigmoid**、负责矩阵乘积的**Affine**等，都是以层为单位实现的。

5.4.1乘法层的实现和加法层的实现

> 层的实现中有两个共同的方法（接口）`forward()`和`backward()`,`forward()`对应正向传播，`backward()`对应反向传播。现在来实现乘法层，乘法层作为`MulLayer`类，其实现过程如下： ```python # ch05/layer_nive.py class MulLayer: def __init__(self): self.x = None self.y = None def forward(self, x, y): self.x = x self.y = y out = x * y return out def backward(self, dout): dx = dout * self.y dy = dout * self.x return dx, dy ``` 加法层实现过程如下： ```python class AddLayer: def __init__(self): pass def forward(self, x, y): out = x + y return out def backward(self, dout): dx = dout * 1 dy = dout * 1 return dx, dy ```

5.5激活层的实现

> 本节，我们将计算图的思路应用到神经网络中。这里我们把构成神经网络的层实现为一个类。

5.5.1ReLU层

激活函数ReLU(Rectified Linear Unit)由下式表示 $$ y=\begin{cases} x & (x>0) \\ 0 & (x \leq 0) \tag{5.5} \end{cases} $$ 通过式子(5.6)，可以求出$y$关于$x$的导数如式子(5.6) $$ \frac{\partial y}{\partial x} = \begin{cases} 1 & (x > 0) \\ 0 & (x \leq 0) \tag{5.6} \end{cases} $$ 在式子(5.6)中，如果正向传播是的输入$x$大于0，则反向传播会将上游的值原封不动地传给下游。反过来，如果正向传播时的$x$小于等于0，则反向传播中传给下游的信号将停在此处。现在我们来实现**ReLU**层。在神经网络的层的实现中，一般假定$forward()$和$backward()$的参数时$Numpy$数组。 ```python #common/layers.py class Relu: def __init__(self): self.mask = None def forward(self, x): self.mask = (x <= 0) out = x.copy() out[self.mask] = 0 return out def backward(self, dout): dout[self.mask] = 0 dx = dout return dx ``` $Relu$类有实例变量$mask$。这个变量是由$True/False$构成的$Numpy$数组，他会把正向传播时的输入$x$的元素小于等于0的地方保存为$True$，其他地方（大于0的地方）保存为$False$。$out[self.mask]=0$会将$mask$元素中值为$True$的位置置换为0，其余不变。

5.5.2Sigmoid层

$$ y=\frac{1}{1+exp(-1)} \tag{5.7} $$ 正向、反向传播路径对比 > $$ \begin{matrix} x &\rightarrow& -x &\rightarrow& exp(-x) &\rightarrow& 1+exp(-x) &\rightarrow& y \\ \frac{\partial L}{\partial y} y^2 exp(-x) &\leftarrow& -\frac{\partial L}{\partial y} y^2 exp(-x) &\leftarrow& -\frac{\partial L}{\partial y} y^2 &\leftarrow& -\frac{\partial L}{\partial y} y^2 &\leftarrow& \frac{\partial L}{\partial y} \end{matrix} $$ 简化之后为 > $$ \begin{matrix} x &\rightarrow& y \\ \frac{\partial L}{\partial y} y^2 exp(-x) &\leftarrow& -\frac{\partial L}{\partial y} \end{matrix} $$ 将最终结果进一步整理可得 $$ \begin{matrix} \frac{\partial L}{\partial y}y^2exp(-x) &=& \frac{\partial L}{\partial y}\frac{1}{(1+exp(-x)^2)}exp(-x) \\ & = & \frac{\partial L}{\partial y}\frac{1}{1+exp(-x)}\frac{exp(-x)}{1+exp(-x)} \\ &=&\frac{\partial L}{\partial y}y(1-y) \tag{5.8} \end{matrix} $$ 因此最终结果为 > $$ \begin{matrix} x &\rightarrow& y \\ \frac{\partial L}{\partial y}y(1-y) &\leftarrow& -\frac{\partial L}{\partial y} \end{matrix} $$ 代码实现如下： ```python #common/layers.py class Sigmoid: def __init__(self): self.out = None def forward(self, x): out = sigmoid(x) self.out = out return out def backward(self, dout): dx = dout * (1.0 - self.out) * self.out return dx ```

5.6Affine/Softmax层的实现

5.6.1Affine层

> 神经网络中的正想传播中进行的矩阵的乘积运算在几何学领域被称为“**仿射变换**”。因此这里将进行放射变化的处理实现称为“**Affine层**” 对于$np.dot(\textbf{X} \cdot \textbf{W})+\textbf{B}$其反向传播的形式为 $$ \begin{matrix} \frac{\partial L}{\partial \textbf{X}} &=& \frac{\partial L}{\partial \textbf{Y}} \cdot \textbf{W}^T \\ \frac{\partial L}{\partial \textbf{W}} &=& \textbf{X}^T \cdot \frac{\partial L}{\partial \textbf{Y}} \tag{5.9} \end{matrix} $$

5.6.2批版本的Affine层

> 现在我们考虑N个数据一起进行正向传播，也就是批版本的Affine层，假设$X$为$(N,2)$,$W$为$(2,3)$,则 $$ \begin{matrix} \frac{\partial L}{\partial \textbf{X}} &=& \frac{\partial L}{\partial \textbf{Y}} \cdot \textbf{W}^T \quad &((N,2)=(N,3)*(3,2)) \\ \frac{\partial L}{\partial \textbf{W}} &=& \textbf{X}^T \cdot \frac{\partial L}{\partial \textbf{Y}} \quad &((2,3)=(2,N)*(N,3)) \\ \frac{\partial L}{\partial \textbf{B}} &=& \frac{\partial L}{\partial \textbf{Y}} \quad &((3)=(N,3)) \end{matrix} $$ 与刚刚不同的是，现在输入$X$的形状是$(N,2)$,加上偏置时，对于正向传播来说，偏置被加到$X\cdot W$的各个数据上。具体例子如下 ```python >>> x_dot_w = np.array([[0,0,0],[10,10,10]]) >>> B = np.array([1, 2, 3]) >>> x_dot_w array([[]0, 0, 0],[10, 10, 10]) >>>x_dot_w + B array([[1, 2, 3], [11, 12, 13]]) ``` 正向传播时，偏置会被加到每一个数据上。因此，反向传播时，各个数据的反向传播的值需要汇总为偏置的元素。 ```python >>> dY = np.array([1, 2, 3], [4, 5, 6]) >>> dY array([1, 2, 3],[4, 5, 6]) >>>dB = np.sum(dY, axis=0) >>> dB array([5, 7, 9]) ``` 这个例子中，假定数据有两个$(N=2)$。偏置的反向传播会对这2个数据的倒数按元素进行求和。因此这里使用了$np.sum()$函数对第0轴（以数据为单位，axis=0）方向上的元素进行求和。综上所述，Affine实现如下： ```python #common/layers.py class Affine: def __init__(self, W, b): self.W =W self.b = b self.x = None self.original_x_shape = None # 权重和偏置参数的导数 self.dW = None self.db = None def forward(self, x): # 对应张量 self.original_x_shape = x.shape x = x.reshape(x.shape[0], -1) self.x = x out = np.dot(self.x, self.W) + self.b return out def backward(self, dout): dx = np.dot(dout, self.W.T) self.dW = np.dot(self.x.T, dout) self.db = np.sum(dout, axis=0) dx = dx.reshape(*self.original_x_shape) # 还原输入数据的形状（对应张量） return dx ```

5.6.3Softmax-with-Loss层

> 像之前说过的$softmax$函数会将输入值正规化之后在输出。神经网络中进行的处理有**处理**(inference)和**学习**两个阶段。神经网络的推理通常不使用$Softmax$层。神经网络中未被正规化的输出结果有时候被称为"得分"。也就是说，当神经网络的推理只需要给出一个答案的情况下，因为此时只对得分最大值感兴趣，所以不需要$Softmax$层。下面来实现$Softmax$层。考虑到这里也包含作为损失函数的交叉熵误差(cross entrop error)，所以被称为"Soft-with-Loss层"。计算图如下图所示 ![image](https://note.youdao.com/yws/api/personal/file/WEB8a917a03fdd7c96f77feb2656bc907ac?method=download&shareKey=a9cffe116ee763758667f22b3eaa1461) > 反向传播的结果是$(y_1-t_1, y_2-t_2,y_3-t_3)$这样很“漂亮“的结果。由于(y_1,y_2,y_3)是$Softmax$层的输出，$(t_1,t_2,t_3)$是监督数据，所以$(y_1-t_1, y_2-t_2,y_3-t_3)$是$Softmax$层的输出和监督标签的差分。神经网络的反向传播会把这个差分表示的误差传递给前面的层，这是神经网络学习中的重要性质。由于神经网络的**目的**就是通过调整权重参数，使神经网络的输出与监督标签的误差高效地传递给前面的层。交叉熵函数就是被设计成能得出这种“漂亮”的结果的函数。函数实现如下 ```python # common/layers.py class SoftmaxWithLoss: def __init__(self): self.loss = None self.y = None # softmax的输出 self.t = None # 监督数据 def forward(self, x, t): self.t = t self.y = softmax(x) self.loss = cross_entropy_error(self.y, self.t) return self.loss def backward(self, dout=1): batch_size = self.t.shape[0] if self.t.size == self.y.size: # 监督数据是one-hot-vector的情况 dx = (self.y - self.t) / batch_size else: dx = self.y.copy() dx[np.arange(batch_size), self.t] -= 1 dx = dx / batch_size return dx ```

5.7误差反向传播法的实现

> 通过向乐高组装积木一样组装上一节中实现的层，可以构建神经网络。本节我们将通过组装已经实现的层来构建神经网络

5.7.1神经网络学习全貌图

> 神经网络学习的步骤如下所示： $\textbf{前提}$:神经网络中有合适的权重和偏执，调整权重和偏置以便拟合训练数据的过程称为学习。神经网络的学习过程分为下面四个步骤。 $\textbf{步骤1}$(mini-batch):从训练数据中随机选择一部分数据。 $\textbf{步骤2}$(计算梯度):计算相关损失函数关于各个权重参数的梯度。 $\textbf{步骤3}$(更新参数):将权重参数沿梯度方向进行微小的更新。 $\textbf{步骤4}$(重复):重复步骤1，步骤2，步骤3 误差反向传播法会在步骤2中出现，误差反向传播法可以高效快速地计算梯度。

5.7.2对应误差反向传播发的神经网络的实现

> 现在来进行神经网络的实现。这里我们要把2层神经网络实现为TwoLayerNet。将类的实例变量和方法整理成表5-1和表5-2

表 5-1 TwoLayerNet类中使用的变量

变量	说明
params	保存神经网络的参数的字典变量 params[ 'W1' ]是第1层的权重，params['b1']是第1层的偏置。 params[ 'W2' ]是第2层的权重，params['b2']是第2层的偏置。
layers	保存神经网络的层的有序字典变量以layers['Affine1']、layers['ReLu1']、layers['Affine2']的形式通过有序字典保存各个层

表 5-2 TwoLayerNet类的方法

方法	说明
__init__(self, input_size, hidden_size,output_size, weight_init_std)	进行初始化参数从头开始依次表示输入层的神经元数、隐藏层的神经元数、输出层的神经元数、初始化权重时的高斯分布的规模。
predict(self, x)	进行识别(推理) 参数x是图像数据。
loss(self, x, t)	计算损失函数的值参数x是图像数据，t是正确解标签
accuracy(self, x, t)	计算识别精度
numerical_gradient(self, x, t)	通过数值微分计算关于权重参数的梯度（同上一章）
gradient(self, x, t)	计算权重参数的梯度通过误差反向传播法计算关于权重参数的梯度。

这里与4.56节的学习算法有很多共通的部分，不同点主要在于这里使用了层。通过使用层，获得识别结果的处理(predict())和计算梯度的处理(gradient())只需要通过层之间的传递就能完成。下面是$TwoLayerNet$的代码实现。 ```python # coding: utf-8 import sys, os sys.path.append(os.pardir) # 为了导入父目录的文件而进行的设定 import numpy as np from common.layers import * from common.gradient import numerical_gradient from collections import OrderedDict class TwoLayerNet: def __init__(self, input_size, hidden_size, output_size, weight_init_std = 0.01): # 初始化权重 self.params = {} self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size) self.params['b1'] = np.zeros(hidden_size) self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size) self.params['b2'] = np.zeros(output_size) # 生成层 self.layers = OrderedDict() self.layers['Affine1'] = Affine(self.params['W1'], self.params['b1']) self.layers['Relu1'] = Relu() self.layers['Affine2'] = Affine(self.params['W2'], self.params['b2']) self.lastLayer = SoftmaxWithLoss() def predict(self, x): for layer in self.layers.values(): x = layer.forward(x) return x # x:输入数据, t:监督数据 def loss(self, x, t): y = self.predict(x) return self.lastLayer.forward(y, t) def accuracy(self, x, t): y = self.predict(x) y = np.argmax(y, axis=1) if t.ndim != 1 : t = np.argmax(t, axis=1) accuracy = np.sum(y == t) / float(x.shape[0]) return accuracy # x:输入数据, t:监督数据 def numerical_gradient(self, x, t): loss_W = lambda W: self.loss(x, t) grads = {} grads['W1'] = numerical_gradient(loss_W, self.params['W1']) grads['b1'] = numerical_gradient(loss_W, self.params['b1']) grads['W2'] = numerical_gradient(loss_W, self.params['W2']) grads['b2'] = numerical_gradient(loss_W, self.params['b2']) return grads def gradient(self, x, t): # forward self.loss(x, t) # backward dout = 1 dout = self.lastLayer.backward(dout) layers = list(self.layers.values()) layers.reverse() for layer in layers: dout = layer.backward(dout) # 设定 grads = {} grads['W1'], grads['b1'] = self.layers['Affine1'].dW, self.layers['Affine1'].db grads['W2'], grads['b2'] = self.layers['Affine2'].dW, self.layers['Affine2'].db return grads ``` 注意$\textbf{OrderDict}$是有序字典，“有序”是指他可以记住向字典里添加元素的顺序。因此神经网络的正向传播只需要按照添加元素的顺序调用各层的`forward()`方法就可以完成处理。反向处理按照相反顺序调用就行(Affine和ReLu层内部会正确处理正向传播和反向传播)。通过将神经网络的组成元素以层的方式实现，可以轻松地构建神经网络。若想构建更大的神经网络时只需要像组装乐高一样添加必要的层

5.7.3误差反向传播法的梯度确认

目前我们一共学习了两种求梯度的方法。一种是基于数值微分另一种是基于解析性地求解数学式的方法。后一种通过误差反向传播法，即使存在大量参数，也可以高效地计算梯度。数值微分的优点是实现简单，因此一般情况下不太容易出错。而误差反向传播法的实现很复杂，容易出错。所以经常会比较数值微分的结果和误差反向传播法求出的结果是否一致（是否相近）的操作称为**梯度确认**。 ```python #common/gradient_check.py # coding: utf-8 import sys, os sys.path.append(os.pardir) # 为了导入父目录的文件而进行的设定 import numpy as np from dataset.mnist import load_mnist from two_layer_net import TwoLayerNet # 读入数据 (x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True) network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10) x_batch = x_train[:3] t_batch = t_train[:3] grad_numerical = network.numerical_gradient(x_batch, t_batch) grad_backprop = network.gradient(x_batch, t_batch) for key in grad_numerical.keys(): diff = np.average( np.abs(grad_backprop[key] - grad_numerical[key]) ) print(key + ":" + str(diff)) ``` 一般来说数值微分和误差反向传播法的计算结果之间的误差为0是很少见的。因为计算机的计算精度有限。一般来说是一个接近0的很小的值。

5.7.4使用误差反向传播法的学习

```python # coding: utf-8 import sys, os sys.path.append(os.pardir) import numpy as np from dataset.mnist import load_mnist from two_layer_net import TwoLayerNet # 读入数据 (x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, one_hot_label=True) network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10) iters_num = 10000 train_size = x_train.shape[0] batch_size = 100 learning_rate = 0.1 train_loss_list = [] train_acc_list = [] test_acc_list = [] iter_per_epoch = max(train_size / batch_size, 1) for i in range(iters_num): batch_mask = np.random.choice(train_size, batch_size) x_batch = x_train[batch_mask] t_batch = t_train[batch_mask] # 梯度 #grad = network.numerical_gradient(x_batch, t_batch) grad = network.gradient(x_batch, t_batch) # 更新 for key in ('W1', 'b1', 'W2', 'b2'): network.params[key] -= learning_rate * grad[key] loss = network.loss(x_batch, t_batch) train_loss_list.append(loss) if i % iter_per_epoch == 0: train_acc = network.accuracy(x_train, t_train) test_acc = network.accuracy(x_test, t_test) train_acc_list.append(train_acc) test_acc_list.append(test_acc) print(train_acc, test_acc) ```

第六章与学习相关的技巧

> 本章将介绍神经网络的学习中一些重要的观点，主题涉及寻找**最优权重参数**的优化方法、**权重参数的初始值**、**超参数的设定方法**等。为了应对过拟合，本章还将介绍**权值衰减**、**Dropout**等正则化方法，并进行实现。最后将对近年来众多研究中使用的**Batch Normalization**方法进行简单的介绍。

参数的更新

> 神经网络的学习的目的是找到使损失函数的值尽可能小的参数。这是寻找最优参数的问题，解决这个问题的过程称为**最优化**(Optimization)。前面几章中为了找到最优参数，我们将参数的梯度，沿参数的梯度更新参数，并重复这个步骤多次，从而逐渐靠近最优参数，这个过程称为**随机梯度下降法**(stochastic gradient descent),简称**SGD**。

SGD

用数学的方式来表示SGD： $$ \textbf{W} \leftarrow \textbf{W} - \eta\frac{\partial L}{\partial \textbf{W}} \tag{6.1} $$ 我们将SGD实现为一个Python类：