diff --git a/assignment-3/submission/18307130090/README.md b/assignment-3/submission/18307130090/README.md new file mode 100644 index 0000000000000000000000000000000000000000..081473e6a2acf22cfd066b199b9b43c4230a1d1d --- /dev/null +++ b/assignment-3/submission/18307130090/README.md @@ -0,0 +1,233 @@ +# PRML-2021 Assignment3 + +姓名:夏海淞 + +学号:18307130090 + +## 简述 + +在本次实验中,我通过`NumPy`实现了`K-Means`和`GMM`两个用于聚类分析的模型,并在此基础上使用`BIC`和`Gap Statistic`方法实现了自动判断数据集中聚簇的数量并进行聚类的算法。为了验证模型和算法的正确性与效果,我在不同的二维数据集上进行了实验,对两者的运行时间和运行效果进行了比较。在实验中,我尝试对算法进行了一定优化,取得了较好的效果。 + +## 模型实现 + +### `K-Means` + +`K-Means`模型在构建时随机选取聚簇数目个点,作为聚簇的中心点;在随后的每次迭代中包含以下步骤: + +- 计算每个样本点至聚簇中心点的距离,将每个分配至距离最短的中心点对应的聚簇中; +- 计算聚簇中样本点的均值,将其作为新的聚簇中心点。 + +容易发现,`K-Means`模型通过样本点至聚簇中心点的距离预测其归属,聚簇中心点也是`K-Means`的参数。因此,聚簇中心点的初始选择对模型最终结果有着较大的影响。下图展示了当聚簇中心点选择不当时,模型容易收敛至局部最优解而非全局最优解的情形。 + +![](./img/myplot.png) + +因此考虑选择一种策略,尽量避免在选择聚簇中心点时导致模型陷入局部最优解。在`K-Means++`算法中[^Arthur D 2006],作者提出了一种随机选取聚簇中心点的方法: + +- 随机选取一个样本点作为聚簇中心; +- 在随机选择下一个聚簇中心时,确保距离当前聚簇中心越远的点被选中的概率越大; +- 重复上述过程直至选出所有聚簇中心。 + +考虑到样本点数量较多时概率分配较为繁琐,我在构建模型时没有照搬该方法,而是根据“样本点之间离得越远越好”的思想,多次选择聚簇中心,最后采用平均距离最远的聚簇中心集作为结果返回: + +```python +def choose_centers(train_data, k, rand_num): # 生成初始聚簇中心点集 + rng = np.random.default_rng() + centers = rng.choice(train_data, size=k, axis=0, replace=False) # 从样本中无放回抽取样本点作为聚簇中心 + if k == 1: return centers + dist = dis_avg(centers) # 计算平均相互距离 + for i in range(rand_num - 1): # 选择相互距离最大的聚簇中心集 + new_centers = rng.choice(train_data, size=k, axis=0, replace=False) + new_dist = dis_avg(new_centers) + if new_dist < dist: continue + dist = new_dist + centers = new_centers + return centers +``` + +### `GMM` + +高斯混合模型`GMM`通过`Expectation-Maximum`算法进行参数估计。 + +设$\bold x$为样本点,$z$表示样本所属的高斯分布,$\pi,\mu,\sigma$分别表示每个高斯分布的分配概率、均值和协方差,在每次迭代中: + +- `Expectation`步:固定参数$\mu,\sigma$,计算后验分布$\gamma_{nk}=p(z^{(n)}=k|x^{(n)})$; + +- `Maximum`步:固定$\gamma_{nk}$,更新参数$\pi,\mu,\sigma$: + $$ + N_k=\sum_{n=1}^N\gamma_{nk}\\\\ + \pi_k=\frac{N_k}{N}\\\\ + \sigma_k=\frac1{N_k}\sum_{n=1}^N\gamma_{nk}(x^{(n)}-\mu_k)(x^{(n)}-\mu_k)^T\\\\ + \mu_k=\frac1{N_k}\sum_{n=1}^N\gamma_{nk}x^{(n)} + $$ + +在将`GMM`和`K-Means`进行对比时,我发现两者有一些相似之处。`K-Means`算法每次迭代时首先将每个样本点分配至某一个聚簇中,而`GMM`中计算的后验分布也有类似的作用,即量化样本点属于每个高斯分布的概率,以概率的形式对每个样本点进行了“分配”。随后`K-Means`根据分配结果重新计算聚簇中心点,而`GMM`中则是根据后验分布重新计算每个高斯分布的参数。两者的思想类似,而具体实现的角度不同。前者是通过距离的形式进行划分,而后者是根据概率的形式进行划分。 + +在构建`GMM`的时候,我碰到了一些问题: + +- 初始情况下,当样本距离每个高斯分布均较远时,联合概率均趋向于0,由于精度问题无法正确计算后验分布。注意到这种情况属于个别现象,因此我在处理时将其后验分布设置为均匀分布。 +- 在运行`GMM-2`的测试数据时,发现由于计算得到的协方差矩阵为奇异矩阵,无法计算高斯分布。查询相关资料后,发现这种现象是由于样本数量相较于维数过小出现的现象,也可以视作模型过拟合的一种表现。解决方法是:①增大样本量②降低样本维数。考虑到降维过于繁琐,我简单地对样本进行复制后加入噪声,弥补了样本数量不足的问题。 + +## 基础实验 + +### 高斯分布 + +#### 数据集参数 + +$$ +\mu_x=\begin{bmatrix}0&0\end{bmatrix},\mu_y=\begin{bmatrix}-20&20\end{bmatrix},\mu_z=\begin{bmatrix}30&30\end{bmatrix},\mu_x=\begin{bmatrix}-30&-30\end{bmatrix},\mu_x=\begin{bmatrix}30&-30\end{bmatrix}\\\\ +\Sigma_x=\begin{bmatrix}70&0\\\\0&70\end{bmatrix},\Sigma_y=\Sigma_z=\Sigma_w=\Sigma_t=\begin{bmatrix}10&0\\\\0&10\end{bmatrix}\\\\ +|x|=3500,|y|=|z|=|w|=|t|=750 +$$ + +![](./img/basic.png) + +使用`K-Means`和`GMM`进行聚类,画出各自的聚簇: + +#### `K-Means` + +![](./img/basic_kmeans.png) + +其中黄色点代表聚簇中心。 + +#### `GMM` + +![](./img/basic_gmm.png) + +其中黄色点代表高斯分布均值。 + +--- + +容易发现,在高斯混合分布下,`K-Means`由于其采用距离作为分配聚簇的标准,在不同高斯分布的边界处产生了较大的误差;而采用高斯分布后验概率作为分配聚簇标准的`GMM`则表现较好。 + +### 其他分布 + +考虑到数据集采用多维高斯分布产生,采用高斯分布后验概率作为分配聚簇标准的`GMM`取得良好效果有“先射箭后画靶”的嫌疑。因此考虑采用其他分布生成随机数据集: + +尝试通过均匀分布产生矩形数据集和三角形数据集: + +![](./img/basic_matrix.png) + +![](./img/basic_triangle.png) + +使用`K-Means`和`GMM`进行聚类,画出各自的聚簇: + +#### `K-Means` + +![](./img/basic_matrix_kmeans.png) + +![](./img/basic_triangle_kmeans.png) + +其中黄色点代表聚簇中心。 + +#### `GMM` + +![](./img/basic_matrix_gmm.png) + +![](./img/basic_triangle_gmm.png) + +其中黄色点代表高斯分布均值。 + +容易发现,即使是对于非高斯分布的情形,`GMM`在不规则分布中依然取得了更好的效果。因此推测`GMM`的聚类能力更加出色。 + +## 自动选择聚簇数量的实验 + +### `Gap Statistic` + +助教在发布作业时提及了使用`Elbow Method`配合`K-Means`算法实现主动选择聚簇数量。因此我首先对其进行了研究。 + +聚类的主要目的是将类似的对象分配到同一聚簇中,即降低簇内平均距离。设当聚簇数量为$k$时簇内平均距离之和为$W_k$,考察$W_k$的曲线时发现$W_k$单调减,且在经过某个拐点后其斜率明显变得平缓,形如一个手肘,而拐点,也就是“肘部”被`Elbow Method`认为是聚簇数量的最佳位置。 + +然而,肘部的寻找并不容易。例如,当$W_k$曲线开始较平缓时,例如下图,很难量化“肘部”的判定标准。我查阅资料了解到,有一种被称为`Gap Statistic`的方法能够量化`Elbow Method`的判定标准[^Tibshirani R 2001]。 + +![](./img/w.png) + +`Gap Statistic`的思想是将给定样本聚类后的簇内距离与空间内均匀采样的样本聚类后的期望簇内距离进行比较,两者之差被称为$Gap$。$Gap$衡量了样本的簇内距离与零假设下的偏差。`Gap Statistic`方法认为最佳的聚簇数量应当能够最大化$Gap$。 + +在实际运行中,`Gap Statistic`方法在样本覆盖的矩形中进行相同次数的均匀采样,计算聚类后的期望簇内距离: + + +$$ +Gap_k=E(\log W_k^{(b)})-\log W_k\\\\ +E(\log W_k^{(b)})=\frac{1}{B}\sum_{b=1}^B\log W_k^{(b)} +$$ + + +`Gap Statistic`方法最后选择满足$Gap_k\ge Gap_{k+1}-s_{k+1}$的最小的$k$,其中$s_{k}=\sqrt{1+1/B}\cdot\text{std}(\log W_k^{(b)})$。 + +画出与上图对应的$E(\log W_k^{(b)})$,$\log W_k$和$Gap$的曲线: + +![](./img/expect_w.png) + +![](./img/gap.png) + +容易发现,此时$k=3$为最佳的聚簇数量。 + +将使用`Gap Statistic`方法的`K-Means`模型用于三角数据集,得到下图的预测结果: + +![](./img/auto_kmeans.png) + +尽管`Gap Statistic`算法给出了错误的聚簇数量,然而从人类视角看,该聚类也有合理之处,且与正确聚簇数量相差不大。 + +### 簇内距离计算的优化 + +尽管`Gap Statistic`成功实现了量化“肘部”的效果,然而其运行时需要重复采样,聚类后还要计算簇内距离。当聚簇数量较小时,计算簇内距离的时间代价接近$O(n^2)$,其中$n$为样本容量,当样本点较多时运行缓慢。因此考虑对其进行优化。 + +首先,查阅资料得知,簇内距离的计算可以使用`numpy.einsum`函数改写来加速[^kmeans算法的性能改进,优化,距离,计算 (pythonf.cn)]。`numpy.einsum`是一个通过爱因斯坦求和约定(Einstein summation convention)对张量进行计算的函数,具备强大的灵活性和高效的性能。而我之前是通过`numpy.linalg.norm`计算簇内距离的。在将簇内距离的计算改用`numpy.einsum`表达后,该部分的运行时间降低为原先的10%。 + +然而,`numpy.einsum`的性能提升依然较为有限。考虑降低簇内距离的运算量。当簇内样本容量较大时,聚簇较为密集。此时可以随机抽取簇内样本的一部分进行计算,结果在进行修正后接近使用全部样本计算得到的值。实践中,对于样本容量大于250的聚簇,对其按照50%的采样率进行采样后计算簇内距离。这一优化在理论上将簇内距离的计算时间进一步降低至原先的25%。 + +### `Gap Statistic`与`BIC`的结合 + +在`GMM`中,模型有明确的优化目标:对数边际分布$\sum_{n=1}^N\log p(x^{(n)})$。因此查阅资料了解到,在通过`GMM`实现自动确定聚簇数量的算法时,可以采用`BIC`(Bayesian Information Criterion,贝叶斯信息准则)作为判断聚簇数量的依据。`BIC`引入了模型复杂程度的因素,用来惩罚过于复杂的模型,公式为$BIC=k\log n-2\log L$,其中$k,n,L$分别表示模型参数个数、样本数量和最大似然。 + +在引入`BIC`度量聚簇数量时,发现由于`GMM`的模型参数过少,`BIC`的惩罚力度不够,其曲线未按照设想的呈“U”型,而是呈现为“手肘”型,如下图所示。因此考虑将`BIC`与`Gap Statistic`结合,通过`Gap Statistic`方法找到`BIC`的拐点,从而确定最佳聚簇数量。 + +![](./img/BIC.png) + +将使用`Gap Statistic`方法的`GMM`用于三角数据集,得到下图的预测结果: + +![](./img/auto_gmm.png) + +可以发现,`Gap Statistic`方法帮助`GMM`找到了正确的聚簇数量。 + +### `GMM`的性能优化 + +尝试对`GMM`中频繁调用的计算高斯分布的函数`get_norm_pdf`进行优化: + +```python +def get_norm_pdf(x, mean, cov): + n = cov.shape[0] + denominator = (np.sqrt(2 * np.pi) ** n) * np.sqrt(np.linalg.det(cov)) + tmp = x - mean + numerator = tmp @ np.linalg.inv(cov) @ tmp.T + return np.exp(numerator / (-2)) / denominator +``` + +使用PyCharm对其进行性能分析后发现,大部分时间用于计算协方差矩阵的行列式和逆矩阵。考虑到单次迭代中协方差矩阵不会改变,因此可以在更新协方差矩阵时同时预计算其行列式和逆矩阵,避免重复计算: + +```python +def get_norm_pdf(x, mean, cov_inv, denominator): # 计算多维高斯分布,采用预计算策略,提高程序性能 + tmp = x - mean + numerator = tmp @ cov_inv @ tmp.T + return np.exp(numerator / (-2)) / denominator + +# 提前计算协方差的逆,用于加快高斯分布概率密度的计算 +self.cov_inv = [np.linalg.inv(single_cov) for single_cov in self.cov] + +# 提前计算协方差的行列式,用于加快高斯分布概率密度的计算 +cov_det = [np.abs(np.linalg.det(single_cov)) for single_cov in self.cov] + +# 用于加快高斯分布概率密度的计算 +self.denominator = (np.sqrt(2 * np.pi) ** self.dim) * np.sqrt(cov_det) +``` + +## 参考资料 + +[^Arthur D 2006]:[Arthur D, Vassilvitskii S. K-Means++: The advantages of careful seeding[R]. Stanford, 2006.](http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf) + + +[^Tibshirani R 2001]:[Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2001, 63(2): 411-423.](http://web.stanford.edu/~hastie/Papers/gap.pdf) + + +[^kmeans算法的性能改进,优化,距离,计算 (pythonf.cn)]:https://www.pythonf.cn/read/109964 + diff --git a/assignment-3/submission/18307130090/img/BIC.png b/assignment-3/submission/18307130090/img/BIC.png new file mode 100644 index 0000000000000000000000000000000000000000..0cd557907ae7642d18149817f570d638bfd8c9f9 Binary files /dev/null and b/assignment-3/submission/18307130090/img/BIC.png differ diff --git a/assignment-3/submission/18307130090/img/auto_gmm.png b/assignment-3/submission/18307130090/img/auto_gmm.png new file mode 100644 index 0000000000000000000000000000000000000000..94bb8cfbfafed2e8bf484c4266ff1e303f93f0d2 Binary files /dev/null and b/assignment-3/submission/18307130090/img/auto_gmm.png differ diff --git a/assignment-3/submission/18307130090/img/auto_kmeans.png b/assignment-3/submission/18307130090/img/auto_kmeans.png new file mode 100644 index 0000000000000000000000000000000000000000..05034a150a09475f801459cf717d3686478861d9 Binary files /dev/null and b/assignment-3/submission/18307130090/img/auto_kmeans.png differ diff --git a/assignment-3/submission/18307130090/img/basic.png b/assignment-3/submission/18307130090/img/basic.png new file mode 100644 index 0000000000000000000000000000000000000000..8d4f4aa1c967277d5e5c8189ab9eaf301846ee54 Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic.png differ diff --git a/assignment-3/submission/18307130090/img/basic_gmm.png b/assignment-3/submission/18307130090/img/basic_gmm.png new file mode 100644 index 0000000000000000000000000000000000000000..8cc25517e78447a477d149a02ad57ce163e4d0ad Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic_gmm.png differ diff --git a/assignment-3/submission/18307130090/img/basic_kmeans.png b/assignment-3/submission/18307130090/img/basic_kmeans.png new file mode 100644 index 0000000000000000000000000000000000000000..4836daa5fbea4905b04537929902280b87d53b09 Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic_kmeans.png differ diff --git a/assignment-3/submission/18307130090/img/basic_matrix.png b/assignment-3/submission/18307130090/img/basic_matrix.png new file mode 100644 index 0000000000000000000000000000000000000000..3b48e69f417ec5a8277603500ab75b3a98e581a0 Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic_matrix.png differ diff --git a/assignment-3/submission/18307130090/img/basic_matrix_gmm.png b/assignment-3/submission/18307130090/img/basic_matrix_gmm.png new file mode 100644 index 0000000000000000000000000000000000000000..e1f8301fcd213ca4cb4de0b062d4a9f7075e3e93 Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic_matrix_gmm.png differ diff --git a/assignment-3/submission/18307130090/img/basic_matrix_kmeans.png b/assignment-3/submission/18307130090/img/basic_matrix_kmeans.png new file mode 100644 index 0000000000000000000000000000000000000000..319f94835d3a36ef150a1fab479c29317359e12f Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic_matrix_kmeans.png differ diff --git a/assignment-3/submission/18307130090/img/basic_triangle.png b/assignment-3/submission/18307130090/img/basic_triangle.png new file mode 100644 index 0000000000000000000000000000000000000000..30337a41f763feeeea8353a8446e4ff9c54e3ac5 Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic_triangle.png differ diff --git a/assignment-3/submission/18307130090/img/basic_triangle_gmm.png b/assignment-3/submission/18307130090/img/basic_triangle_gmm.png new file mode 100644 index 0000000000000000000000000000000000000000..1aaa5cee2bac21b32089198e6eb7f284394bb34d Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic_triangle_gmm.png differ diff --git a/assignment-3/submission/18307130090/img/basic_triangle_kmeans.png b/assignment-3/submission/18307130090/img/basic_triangle_kmeans.png new file mode 100644 index 0000000000000000000000000000000000000000..d55094b504ceb03a5490b4fa4c8afad51a178487 Binary files /dev/null and b/assignment-3/submission/18307130090/img/basic_triangle_kmeans.png differ diff --git a/assignment-3/submission/18307130090/img/expect_w.png b/assignment-3/submission/18307130090/img/expect_w.png new file mode 100644 index 0000000000000000000000000000000000000000..119c3a27efd2ca6d7ee017dac88e7804f9569678 Binary files /dev/null and b/assignment-3/submission/18307130090/img/expect_w.png differ diff --git a/assignment-3/submission/18307130090/img/false_gap.png b/assignment-3/submission/18307130090/img/false_gap.png new file mode 100644 index 0000000000000000000000000000000000000000..569f1d156a7b6a8147ef9444047e06efe2e9a1a2 Binary files /dev/null and b/assignment-3/submission/18307130090/img/false_gap.png differ diff --git a/assignment-3/submission/18307130090/img/gap.png b/assignment-3/submission/18307130090/img/gap.png new file mode 100644 index 0000000000000000000000000000000000000000..2fbe513de9583e08bb4a72701f7c7f1ca6fd76ad Binary files /dev/null and b/assignment-3/submission/18307130090/img/gap.png differ diff --git a/assignment-3/submission/18307130090/img/myplot-1.png b/assignment-3/submission/18307130090/img/myplot-1.png new file mode 100644 index 0000000000000000000000000000000000000000..7ee762921d219f1ec668809b5477c6a4019d1234 Binary files /dev/null and b/assignment-3/submission/18307130090/img/myplot-1.png differ diff --git a/assignment-3/submission/18307130090/img/myplot.png b/assignment-3/submission/18307130090/img/myplot.png new file mode 100644 index 0000000000000000000000000000000000000000..8136b211c4abcbb4aa7b0d944af38a9608ce2517 Binary files /dev/null and b/assignment-3/submission/18307130090/img/myplot.png differ diff --git a/assignment-3/submission/18307130090/img/scatter.png b/assignment-3/submission/18307130090/img/scatter.png new file mode 100644 index 0000000000000000000000000000000000000000..121b56c5b1fef2ba06ab03c42980fe271bbdf5a6 Binary files /dev/null and b/assignment-3/submission/18307130090/img/scatter.png differ diff --git a/assignment-3/submission/18307130090/img/w.png b/assignment-3/submission/18307130090/img/w.png new file mode 100644 index 0000000000000000000000000000000000000000..72c196ee13b0149a14e5a15bd69e500c46336e47 Binary files /dev/null and b/assignment-3/submission/18307130090/img/w.png differ diff --git a/assignment-3/submission/18307130090/source.py b/assignment-3/submission/18307130090/source.py new file mode 100644 index 0000000000000000000000000000000000000000..7a75f52342d8cec7f156024e06b933375d3d4cc6 --- /dev/null +++ b/assignment-3/submission/18307130090/source.py @@ -0,0 +1,419 @@ +import matplotlib.pyplot as plt +import numpy as np + +colors = ['red', 'sienna', 'tan', 'olivedrab', 'chartreuse', 'lightseagreen', 'skyblue', 'royalblue', 'lightcoral', + 'darkorange', 'darkviolet', 'fuchsia', 'black', 'grey'] # 绘制散点图所需的颜色集 + + +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int) * i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + + +def get_norm_pdf(x, mean, cov_inv, denominator): # 计算多维高斯分布,其中cov_inv和denominator采用预计算策略,提高程序性能 + tmp = x - mean + numerator = tmp @ cov_inv @ tmp.T + return np.exp(numerator / (-2)) / denominator + + +def choose_centers(train_data, k, rand_num): # 生成初始聚簇中心点集 + rng = np.random.default_rng() + centers = rng.choice(train_data, size=k, axis=0, replace=False) # 从样本中无放回抽取样本点作为聚簇中心 + if k == 1: return centers + dist = dis_avg(centers) # 计算平均相互距离 + for i in range(rand_num - 1): # 选择相互距离最大的聚簇中心集 + new_centers = rng.choice(train_data, size=k, axis=0, replace=False) + new_dist = dis_avg(new_centers) + if new_dist < dist: continue + dist = new_dist + centers = new_centers + return centers + + +def dis_avg(dots): # 计算簇内平均距离 + n = dots.shape[0] + ans = np.einsum('ij->', dis(dots, dots, False)) # 等价于矩阵求和 + # ans = sum(np.linalg.norm(dots[i] - dots[j], ord=2) for i in range(n) for j in range(i + 1, n)) + return ans / (2 * n) + + +def dis(x, y, square=True): # 通过numpy.einsum计算点集x与点集y之间的两两距离 参考自https://www.pythonf.cn/read/109964 + xx = np.einsum('ij,ij->i', x, x)[:, np.newaxis] # 计算x的平方的行和 + yy = np.einsum('ij,ij->i', y, y)[np.newaxis, :] # 计算y的平方的行和 + dot = np.dot(x, y.T) # 计算xy对应的行和 + res = np.abs(xx + yy - 2 * dot) # (x-y)^2=x^2+y^2-2xy 取绝对值用来防止因精度问题出现负值 + return res if square else np.sqrt(res + 1e-12) + + +def my_scatter(np_data, color): # 绘制聚簇散点图的函数 + if np_data.shape[0] == 0: return + data = np.transpose(np_data).tolist() + plt.scatter(data[0], data[1], c=colors[color] if color is not None else 'yellow') + + +def draw_clusters(clusters, centers): # 绘制所有聚簇的散点图 + k = len(clusters) + for i in range(k): + my_scatter(clusters[i], i) + my_scatter(centers, None) + plt.show() + + +def draw_list(l): # 绘制度量(如BIC、gap)曲线的函数 + n = len(l) + x = list(range(1, n + 1)) + plt.plot(x, l) + plt.show() + + +def my_inner(x, mu): # 计算GMM中协方差矩阵的函数 + tmp = x - mu + ans = np.einsum('i,j->ij', tmp, tmp) + return ans + + +def get_clusters(dot2center, data, k): # 为每个样本分配聚簇后,生成每个聚簇的点集 + n = data.shape[0] + clusters = [[] for _ in range(k)] + for i in range(n): + clusters[dot2center[i]].append(data[i]) + for i in range(k): + clusters[i] = np.array(clusters[i]) + return clusters + + +class KMeansCore: # K-Means和K-Means Plus(即自动选择聚簇数量的K-Means)共同调用的部分 + def __init__(self, n_clusters, gap=False): # gap=False时不计算Gap度量(K-Means调用),gap=True时计算Gap度量(K-Means Plus调用) + self.k = n_clusters # 聚簇数量 + self.is_gap = gap + self.rand_num = 100 # 随机选取聚簇中心的循环次数 + self.n = None # 样本容量 + self.centers = None # 聚簇中心 + + def kmeans_iter(self, train_data): # K-Means 单次迭代 + dot2center = np.argmin(dis(train_data, self.centers), axis=1) # 将样本点分配至距离最近的聚簇中心 + clusters = get_clusters(dot2center, train_data, self.k) # 生成每个聚簇的点集 + return clusters, np.array(dot2center, dtype=int) + + def fit(self, train_data, sample_rate=None): + self.n = train_data.shape[0] + self.centers = choose_centers(train_data, self.k, self.rand_num) # 随机选取聚簇中心 + clusters, dot2center = self.kmeans_iter(train_data) + while True: + self.centers = np.array([np.sum(cluster, axis=0) / cluster.shape[0] for cluster in clusters]) # 计算新的聚簇中心 + clusters, new_dot2center = self.kmeans_iter(train_data) # 将样本点分配至不同聚簇 + if (dot2center == new_dot2center).all(): break # 当样本聚簇归属迭代间不发生变化时结束迭代 + dot2center = new_dot2center + if not self.is_gap: return None + if sample_rate is None: # 不设置采样率时直接返回簇内距离之和 + return sum([dis_avg(cluster) for cluster in clusters]) + rng = np.random.default_rng() # 设置采样器 + ans = 0 + for i in range(self.k): + cluster_size = int(clusters[i].shape[0] * sample_rate) + if cluster_size >= 250: # 聚簇大小小于500时不采样 + clusters[i] = rng.choice(clusters[i], size=cluster_size, axis=0, replace=False) + ans += dis_avg(clusters[i]) / sample_rate # 数值修正 + else: + ans += dis_avg(clusters[i]) + return ans + + def predict(self, test_data): + assert self.centers is not None + return np.argmin(dis(test_data, self.centers), axis=1) # 将测试点分配至距离最近的聚簇中心 + + +class KMeans: # 直接调用KMeansCore + + def __init__(self, n_clusters): + self.core = KMeansCore(n_clusters) + + def fit(self, train_data): + self.core.fit(train_data) + + def predict(self, test_data): + dot2center = self.core.predict(test_data) + # clusters = get_clusters(dot2center, test_data, self.core.k) + # draw_clusters(clusters, self.core.centers) + return dot2center + + +class KMeansPlus: # 自动确定聚簇数量的K-Means模型 + + def __init__(self): + self.core = None # 调用KMeansCore对象 + self.k = None # 聚簇数量 + self.n = None # 样本容量 + self.sample_rate = None # 采样率 + self.sample_num = 15 # 采样次数 + + def monte_carlo(self, data_min, data_max): # 计算零假设下的期望簇内距离 + d = data_min.shape[0] + W_list = [] + for _ in range(self.sample_num): + print(f'Monte Carlo {_}') + sample_data = np.concatenate([np.random.uniform(data_min[i], data_max[i], (self.n, 1)) for i in range(d)], + axis=1) # 生成零假设下的样本数据 + W_list.append(self.core.fit(sample_data, self.sample_rate)) + W = np.log(W_list) + return W.mean(), W.std() * np.sqrt(1 + 1 / self.sample_num) # 返回簇内距离期望和修正后的簇内距离标准差 + + def fit(self, train_data): + self.n = train_data.shape[0] + self.sample_rate = None if self.n <= 5000 else 0.5 # 样本容量大于5000时进行采样 + data_min = train_data.min(axis=0, initial=np.inf) # 使采样空间精确覆盖样本空间 + data_max = train_data.max(axis=0, initial=-np.inf) + max_k = min(max(10, int(np.sqrt(self.n / 2.0))), self.n) # 确定遍历的聚簇数量的最大值 + # max_k = 8 + print(f'max_k={max_k}') + pre_gap = None + gap_list = [] + for self.k in range(2, max_k + 1): + print(f'k={self.k}') + self.core = KMeansCore(self.k, gap=True) + W = np.log(self.core.fit(train_data, self.sample_rate)) # 计算簇内距离 + expec_W, s = self.monte_carlo(data_min, data_max) # 计算零假设下的期望簇内距离和修正后的簇内距离标准差 + if pre_gap is not None and expec_W - W - s < pre_gap: # Gap Statistic方法的判定条件 + self.k -= 1 + break + pre_gap = expec_W - W + gap_list.append(pre_gap) + print(f'final k={self.k}') + self.core = KMeansCore(self.k) # 使用确定的聚簇数量构建模型 + self.core.fit(train_data) + # draw_list(gap_list) + + def predict(self, test_data): + dot2center = self.core.predict(test_data) + # clusters = get_clusters(dot2center, test_data, self.k) + # draw_clusters(clusters, self.core.centers) + return dot2center + + +class GaussianMixture: # 基础的GMM,被GaussianMixturePlus调用 + + def __init__(self, n_clusters): + self.k = n_clusters # 聚类数量 + self.pi = np.full(self.k, 1.0 / self.k) # 设定混合高斯分布的概率分布,初始条件下为均匀分布 + self.rand_num = 100 # 随机选取聚簇中心的循环次数 + self.mean = None # 混合高斯分布的均值 + self.cov = None # 混合高斯分布的协方差矩阵 + self.cov_inv = None # 协方差的逆矩阵,提前计算后用于加快高斯分布概率密度的计算 + self.cov_det = None # 协方差的行列式,提前计算后用于加快高斯分布概率密度的计算 + self.denominator = None # 高斯分布概率密度中的分母,提前计算后用于加快高斯分布概率密度的计算 + self.n = None # 样本容量 + self.dim = None # 样本维度 + + def compute_log_like(self, norm_p): # 计算对数边际分布之和 + return np.einsum('i->', np.log(np.einsum('j,ij->i', self.pi, norm_p))) + + def compute_norm_pdf(self, train_data): # 计算样本与多个高斯分布的概率分布 + return np.array( + [[get_norm_pdf(x, self.mean[i], self.cov_inv[i], self.denominator[i]) for i in range(self.k)] + for x in train_data]) + + def compute_post_prob(self, norm_pdf): # 根据概率分布计算后验分布 + uni_prob = np.einsum('ij,j->ij', norm_pdf, self.pi) # 根据高斯分布和高斯分布的概率分布计算联合分布 + uni_prob_sum = np.reciprocal(np.einsum('ij->i', uni_prob)) # 计算联合分布之和 + post_prob = np.einsum('ij,i->ij', uni_prob, uni_prob_sum) # 通过联合分布计算后验分布 + post_prob[np.isnan(post_prob) | np.isinf(post_prob)] = 1.0 / self.k # 处理因精度问题导致的非法值 + return post_prob + + def GMM_iter(self, train_data): # GMM的单次迭代 + norm_pdf = self.compute_norm_pdf(train_data) # 计算高斯分布 + pre_log_like = self.compute_log_like(norm_pdf) # 通过高斯分布计算对数边际分布 + post_prob = self.compute_post_prob(norm_pdf) # 计算后验分布 + N = np.einsum('ij->j', post_prob) + N_recip = np.reciprocal(N) + self.pi = N / self.n # 更新混合高斯分布的概率分布 + x_minus_mean = np.array([[my_inner(x, mu) for mu in self.mean] for x in train_data]) + self.cov = np.einsum('ijk,i->ijk', np.einsum('ij,ijkl->jkl', post_prob, x_minus_mean), + N_recip) # 通过后验分布更新混合高斯分布的协方差 + self.cov_inv = [np.linalg.inv(single_cov) for single_cov in self.cov] # 提前计算协方差的逆,用于加快高斯分布概率密度的计算 + cov_det = [np.abs(np.linalg.det(single_cov)) for single_cov in self.cov] # 提前计算协方差的行列式,用于加快高斯分布概率密度的计算 + self.denominator = (np.sqrt(2 * np.pi) ** self.dim) * np.sqrt(cov_det) # 用于加快高斯分布概率密度的计算 + self.mean = np.einsum('ij,i->ij', np.einsum('ij,ik->jk', post_prob, train_data), N_recip) # 通过后验分布更新混合高斯分布的均值 + norm_pdf = self.compute_norm_pdf(train_data) + log_like = self.compute_log_like(norm_pdf) # 计算新的对数边际分布 + return log_like - pre_log_like + + def fit(self, train_data): + self.n = train_data.shape[0] + self.dim = train_data.shape[1] + if self.n < 8 * self.dim: # 样本容量相对较小时采用过采样策略,防止产生奇异协方差矩阵 + train_data = np.concatenate([ + train_data, + train_data + np.random.rand(self.n, self.dim), + train_data + np.random.rand(self.n, self.dim), + train_data + np.random.rand(self.n, self.dim), + train_data + np.random.rand(self.n, self.dim), + train_data + np.random.rand(self.n, self.dim), + train_data + np.random.rand(self.n, self.dim), + train_data + np.random.rand(self.n, self.dim)], axis=0) + self.n = train_data.shape[0] + self.mean = choose_centers(train_data, self.k, self.rand_num) # 随机选取样本点,作为混合高斯分布的均值 + self.cov = np.concatenate([np.eye(self.dim)[np.newaxis, :] for _ in range(self.k)], axis=0) # 选取单位阵作为混合高斯分布的协方差 + self.cov_inv = [np.linalg.inv(single_cov) for single_cov in self.cov] # 提前计算协方差的逆,用于加快高斯分布概率密度的计算 + cov_det = [np.abs(np.linalg.det(single_cov)) for single_cov in self.cov] # 提前计算协方差的行列式,用于加快高斯分布概率密度的计算 + self.denominator = (np.sqrt(2 * np.pi) ** self.dim) * np.sqrt(cov_det) # 用于加快高斯分布概率密度的计算 + gap = np.inf + cnt = 0 + while gap > 0.1: # 判断对数似然收敛的条件 + gap = self.GMM_iter(train_data) + cnt += 1 + # print(f'\tcnt={cnt} gap={gap}') + if cnt == 100: break # 最大迭代次数 + log_like = self.compute_log_like(self.compute_norm_pdf(train_data)) + BIC = self.k * np.log(self.n) - 2 * log_like # 计算BIC + return BIC + + def predict(self, test_data): + post_prob = self.compute_post_prob(self.compute_norm_pdf(test_data)) # 计算后验分布,将测试点分配给后验概率最大的高斯分布 + dot2center = np.argmax(post_prob, axis=1) + # clusters = get_clusters(dot2center, test_data, self.k) + # draw_clusters(clusters, self.mean) + return dot2center + + +class GaussianMixturePlus: # 自动确定聚簇数量的GMM模型 + def __init__(self): + self.k = None # 聚类数量 + self.n = None # 样本容量 + self.core = None # 调用GaussianMixture对象 + self.sample_num = 10 # 采样次数 + + def monte_carlo(self, data_min, data_max): # 计算零假设下的期望BIC + d = data_min.shape[0] + BIC_list = [] + for _ in range(self.sample_num): + print(f'Monte Carlo {_}') + sample_data = np.concatenate([np.random.uniform(data_min[i], data_max[i], (self.n, 1)) for i in range(d)], + axis=1) # 生成零假设下的样本数据 + BIC_list.append(self.core.fit(sample_data)) + BIC = np.array(BIC_list) + return BIC.mean(), BIC.std() * np.sqrt(1 + 1 / self.sample_num) # 返回BIC期望和修正后的BIC标准差 + + def fit(self, train_data): + self.n = train_data.shape[0] + data_min = train_data.min(axis=0, initial=np.inf) # 使采样空间精确覆盖样本空间 + data_max = train_data.max(axis=0, initial=-np.inf) + max_k = min(max(10, int(np.sqrt(self.n / 2.0))), self.n) + # max_k = 8 + print(f'max_k={max_k}') + BIC_list, gap_list = [], [] + pre_gap = None + for self.k in range(2, max_k + 1): + print(f'k={self.k}') + self.core = GaussianMixture(self.k) + BIC = self.core.fit(train_data) # 计算BIC + BIC_list.append(BIC) + expec_BIC, s = self.monte_carlo(data_min, data_max) # 计算零假设下的期望BIC和修正后的BIC标准差 + if pre_gap is not None and expec_BIC - BIC - s < pre_gap: # Gap Statistic方法的判定条件 + self.k -= 1 + break + pre_gap = expec_BIC - BIC + gap_list.append(pre_gap) + # if len(BIC_list) < 3: continue + # x, y, z = BIC_list[-3:] + # if abs(y - z) < 0.2 * abs(x - y): + # self.k -= 1 + # self.core = GaussianMixture(self.k) + # self.core.fit(train_data) + # break + print(f'final k={self.k}') + self.core = GaussianMixture(self.k) # 使用确定的聚簇数量构建模型 + self.core.fit(train_data) + # draw_list(gap_list) + + def predict(self, test_data): + dot2center = self.core.predict(test_data) + # clusters = get_clusters(dot2center, test_data, self.k) + # draw_clusters(clusters, self.core.mean) + return dot2center + + +def generate_triangle(dot1, dot2, dot3, num): # 生成三角状数据集的函数 + dot1 = np.array(dot1) + dot2 = np.array(dot2) + dot3 = np.array(dot3) + l = np.array([min(dot1[0], dot2[0], dot3[0]), min(dot1[1], dot2[1], dot3[1])]) # 确定采样空间 + r = np.array([max(dot1[0], dot2[0], dot3[0]), max(dot1[1], dot2[1], dot3[1])]) + cnt = 0 + ans = [] + while True: + dot = np.random.uniform(l, r, size=2) # 随机生成样本点 + + # 使用叉积判断生成的点是否位于三角形中间 + x, y, z = np.cross(dot - dot1, dot - dot2), np.cross(dot - dot2, dot - dot3), np.cross(dot - dot3, dot - dot1) + if (x < 0 and y < 0 and z < 0) or (x > 0 and y > 0 and z > 0): + cnt += 1 + ans.append(dot + np.random.rand(2) * 0.75) + if cnt == num: break + return np.array(ans) + + +def generate_data1(): # 数据集1 + mean = (0, 0) + cov = np.array([[70, 0], [0, 70]]) + x = np.random.multivariate_normal(mean, cov, (3500,)) + + mean = (-20, 20) + cov = np.array([[10, 0], [0, 10]]) + y = np.random.multivariate_normal(mean, cov, (750,)) + + mean = (30, 30) + cov = np.array([[10, 0], [0, 10]]) + z = np.random.multivariate_normal(mean, cov, (750,)) + + mean = (-30, -30) + cov = np.array([[10, 0], [0, 10]]) + w = np.random.multivariate_normal(mean, cov, (750,)) + + mean = (30, -30) + cov = np.array([[10, 0], [0, 10]]) + t = np.random.multivariate_normal(mean, cov, (750,)) + data, _ = shuffle(x, y, z, w, t) + return data + + +def generate_data2(): # 数据集2 + mean = (-5, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (800,)) + + mean = (16, -5) + cov = np.array([[21.2, 0], [0, 32.1]]) + y = np.random.multivariate_normal(mean, cov, (200,)) + + mean = (10, 22) + cov = np.array([[10, 5], [5, 10]]) + z = np.random.multivariate_normal(mean, cov, (1000,)) + data, _ = shuffle(x, y, z) + return data + + +def generate_data3(): # 数据集3 + x = generate_triangle([0, -1], [9, 2], [6, 3.5], 500) + y = generate_triangle([1, 8], [4, 7], [5, 12], 500) + z = generate_triangle([5, 5], [8, 12], [12, 6], 500) + w = generate_triangle([8.5, 7], [12, 1], [14, 7], 500) + + data, _ = shuffle(x, y, z, w) + return data + + +if __name__ == '__main__': + data = generate_data3() + + model = GaussianMixturePlus() + model.fit(data) + model.predict(data) diff --git a/assignment-3/submission/18307130090/tester_demo.py b/assignment-3/submission/18307130090/tester_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..4fe62089209bbee87f0850904678b438ead6cb4c --- /dev/null +++ b/assignment-3/submission/18307130090/tester_demo.py @@ -0,0 +1,108 @@ +import numpy as np +import sys + +from source import KMeans, GaussianMixture + + +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int)*i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + + +def data_1(): + mean = (1, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (800,)) + + mean = (16, -5) + cov = np.array([[21.2, 0], [0, 32.1]]) + y = np.random.multivariate_normal(mean, cov, (200,)) + + mean = (10, 22) + cov = np.array([[10, 5], [5, 10]]) + z = np.random.multivariate_normal(mean, cov, (1000,)) + + data, _ = shuffle(x, y, z) + return (data, data), 3 + + +def data_2(): + train_data = np.array([ + [23, 12, 173, 2134], + [99, -12, -126, -31], + [55, -145, -123, -342], + ]) + return (train_data, train_data), 2 + + +def test_with_n_clusters(data_fuction, algorithm_class): + (train_data, test_data), n_clusters = data_fuction() + model = algorithm_class(n_clusters) + model.fit(train_data) + res = model.predict(test_data) + assert len( + res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong" + return res + + +def testcase_1_1(): + test_with_n_clusters(data_1, KMeans) + return True + + +def testcase_1_2(): + res = test_with_n_clusters(data_2, KMeans) + return res[0] != res[1] and res[1] == res[2] + + +def testcase_2_1(): + test_with_n_clusters(data_1, GaussianMixture) + return True + + +def testcase_2_2(): + res = test_with_n_clusters(data_2, GaussianMixture) + return res[0] != res[1] and res[1] == res[2] + + +def test_all(err_report=False): + testcases = [ + ["KMeans-1", testcase_1_1, 4], + ["KMeans-2", testcase_1_2, 4], + # ["KMeans-3", testcase_1_3, 4], + # ["KMeans-4", testcase_1_4, 4], + # ["KMeans-5", testcase_1_5, 4], + ["GMM-1", testcase_2_1, 4], + ["GMM-2", testcase_2_2, 4], + # ["GMM-3", testcase_2_3, 4], + # ["GMM-4", testcase_2_4, 4], + # ["GMM-5", testcase_2_5, 4], + ] + sum_score = sum([case[2] for case in testcases]) + score = 0 + for case in testcases: + try: + res = case[2] if case[1]() else 0 + except Exception as e: + if err_report: + print("Error [{}] occurs in {}".format(str(e), case[0])) + res = 0 + score += res + print("+ {:14} {}/{}".format(case[0], res, case[2])) + print("{:16} {}/{}".format("FINAL SCORE", score, sum_score)) + + +if __name__ == "__main__": + if len(sys.argv) > 1 and sys.argv[1] == "--report": + test_all(True) + else: + test_all()