diff --git a/assignment-3/submission/18307130341/README.md b/assignment-3/submission/18307130341/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a55f5a564d88ede737aa00c8ba0e95ab450eebd9 --- /dev/null +++ b/assignment-3/submission/18307130341/README.md @@ -0,0 +1,243 @@ +# 实验报告:ASS3-KMeans & GMM + +18307130341 黄韵澄 + +[toc] + +### 1.实验概述 + +​ 本实验分别实现了KMeans和GMM模型,在n_clusters数目给定的条件下,对输入的点集实现自动聚类。实验中还使用了Elbow Method 配合 KMeans算法,自动判断数据集中的聚簇的数量并进行聚类。 + +### 2. K-Means方法 + +#### 2.1 K-Means方法简述 + +​ 随机初始化`K`个`centers`,每次遍历数据点,并选取**距离**最近的`centers`作为其分类。分类完成后,在类中求其**质心**作为新的`centers`。迭代多次直到算法收敛,程序终止。 + +​ 考虑到程序的鲁棒性,要保证每次迭代的每个分类都不为空,其实只需保证一开始随机初始化的`centers`都是互不相同的数据点即可(这样至少保证这个数据点在此分类中)。具体做法是将`train_data`随机打乱,并选取前`K`个作为`centers`。 + +#### 2.2 K-Means效果评估和收敛判断方法 + +​ 对K-Means效果评估考虑了以下三种方法: + +- 簇内误差平方和(SSE):计算样本与聚类中心的距离总和。只对单个簇中的数据分析,簇与簇之间的关系没有涉及。 + +$$ +SSE = \sum_{i=1}^k \sum_{p\in c_i}|p -m_i| +$$ + +​ 其中$C_i$表示第$i$个簇,$m_i$表示第$i$个簇的中心。 + +- 轮廓系数:结合内聚度和分离度两种因素。 + +$$ +C(x):数据点x所在的簇 \\\\ +a(x) = \frac{1}{|C(x)|-1}\cdot \sum_{p\in C(x)且 p\neq x}dis(x,p) \\\\ +b(x) = min_{p\notin C(x)}\{dis(x,p)\} \\\\ +S(x) = \frac{b(x)-a(x)}{max\{a(x),b(x)\}} \\\\ +轮廓系数 = 所有样本点的平均轮廓系数 +$$ + +​ 轮廓系数的值是介于 [-1,1] ,越趋近于1代表内聚度和分离度都相对较优。 + +- 调整兰德指数(ARI):监督式评估指标,需要给定实际类别信息C,K是聚类结果。 + +$$ +RI = \frac{a+b}{C_2^{n_{samples}}} \\\\ +ARI = \frac{RI-E[RI]}{max(RI)-E[RI]} +$$ + +​ $a$表示在C与K中都是同类别的元素对数,$b$表示在C与K中都是不同类别的元素对数。$C_2^{n_{samples}}$表示总元素对数。 + +​ 要计算该值, 先计算出列联表(contingency table),表中每个值$n_{ij}$表示某个数据点同时属于$K(Y)$和 $C(X)$ 的个数,再通过该表可以计算 ARI 值: + + + +使用一个二维三分类的数据集简单测试一下上面三种指标: + + + +​ Fig1:误差平方和(均值)随迭代次数的变化图 + + + +​ Fig2:轮廓系数随迭代次数的变化图 + + + +​ Fig3:调整兰德系数随迭代次数的变化图 + + +​ 可以发现,随着迭代次数的增加,以上三个指标都有整体的收敛性。可以通过设定一个阈值`threshold`,当指标的波动幅度小于该阈值时,表示算法已收敛。考虑到$SSE$和$ARI$的计算时间复杂度是$O(N)$的,轮廓系数的计算时间复杂度是$O(N^2)$的,且$SSE$和轮廓系数是无监督指标,$ARI$的计算需要给定真实标签,所以程序中最终使用$SSE$作为**收敛性指标**,$ARI$作为**性能指标**。 + +#### 2.3 自主实验以及算法改进 + +​ 程序中使用 `numpy` 的高斯分布,生成了三分类的二维数据集测试KMeans模型。生成数据集的参数如下: +$$ +\mu_0=[1,2] ,\Sigma_0=\begin{bmatrix}73&0 \\\\ 0&22\end{bmatrix},N_0 = 800\\\\ +\mu_1=[16,-5],\Sigma_1=\begin{bmatrix}21&0 \\\\ 0&32\end{bmatrix},N_1 = 200\\\\ +\mu_2=[10,22],\Sigma_2=\begin{bmatrix}10&5 \\\\ 5&10\end{bmatrix},N_2 = 1000\\\\ +$$ +​ 绘制成散点图如下: + + + +​ Fig4:生成的数据分布 + + + + +​ 设定$SSE$的波动阈值为$10^{-3}$进行实验,结果如下: + + + +​ Fig5:KMeans模型聚类结果随迭代次数的变化图 + + +​ 算法很快就收敛了(第4次迭代的时候结束),收敛后$SSE=19.24$。在图上也可以看出,epoch3和epoch4的聚类结果确实非常接近了。但是由于KMeans没有考虑到高斯分布的性质,最终聚类的结果和正确聚类结果还是有所差别的,这是KMeans这个算法固有的缺点。 + +​ 重复多次实验后,还会出现类似下图的问题: + + + +​ Fig6:KMeans模型的一种偏差结果 + + + + +​ 算法收敛后的聚类结果与预期结果相距甚远,$SSE=34.71$。主要原因是**随机初始化的中心点**聚集在某个簇中,导致KMeans会把一个簇分成多个簇,导致聚类错误。我的解决方法是重复训练KMeans模型若干次,选取$SSE$指标最优的模型作为预测模型。 + +​ 最后重复KMeans算法8次,结果如下: + +| 实验次数 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +| -------- | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | +| SSE | 19.66 | 19.44 | 20.18 | 19.91 | 20.10 | 19.29 | 20.05 | 20.19 | +| ARI | 0.7788 | 0.7937 | 0.7736 | 0.7896 | 0.7913 | 0.7914 | 0.7908 | 0.7614 | + +​ 平均$SSE=19.8525$,平均$ARI=0.7838$,可以和后面GMM方法作比较。 + +### 3.GaussianMixture方法 + +#### 3.1 GaussianMixture方法简述 + +​ (以下公式为多维高斯分布公式,若输入数据`dim=1`,把高斯分布改为一维高斯即可。) + +​ GMM算法使用`K`个高斯分布来表示每个簇。在fit阶段,GMM根据规模为`N`的数据点学习其中的参数$\pi_k,\mu_k,\Sigma_k$。在predict阶段,点x的预测标签为$argmax_k(p(x|z=k))=argmax_kN(x;\mu_k,\Sigma_k)$。($\Sigma_k$为协方差矩阵) + +- 参数初始化(以下两种可选方法): + - 随机初始化:$\mu$同KMeans中的参数初始化,打乱数据点,选取前`K`个数据点作为初始均值;$\Sigma$初始化为单位阵;$\pi$初始化为均匀分布。 + - KMeans初始化:用KMeans模型的结果预处理参数$\pi, \mu ,\Sigma$,再执行GMM算法。 + +- fit阶段E步:固定参数$\mu,\Sigma$,计算后验分布$p(z^{(n)}|x^{(n)})$: + +$$ +\gamma_{nk} =p(z^{(n)}=k|x^{(n)}) = \frac{\pi_kN(x^{(n)};\mu_k,\Sigma_k)}{\sum_{k=1}^K\pi_kN(x^{(n)};\mu_k,\Sigma_k)} +$$ + +- fit阶段M步:优化证据下界$ELBO(\gamma,D;\pi,\mu,\sigma)$,约束$\sum_{k=1}^K\pi_k = 1$: + +$$ +N_k = \sum_{n=k}^N\gamma_{nk}, \\\\ +\pi_k = \frac{N_k}{N},\\\\ +\mu_k = \frac{1}{N_k}\sum_{n=1}^{N}\gamma_{nk}x^{(n)},\\\\ +\Sigma =\frac{1}{N_k}\sum_{n=1}^{N}\gamma_{nk}(x^{(n)}-\mu_k)(x^{(n)}-\mu_k)^T,\\\\ +$$ + +- 收敛判断:以对数边际分布$\sum_{n=1}^{N}log\ p(x^{(n)})$作为收敛指标,设置某个波动阈值`threshold`。当对数边际分布的波动小于该阈值时,算法停止。 + +#### 3.2 GMM模型的问题以及解决方法 + +​ 使用2.3中相同的参数生成数据进行测试,使用**随机初始化方法**,结果如下: + + + +​ Fig7:GMM模型的6次聚类结果 + + + + +​ 和2.3中KMeans中的问题一样,随机初始化对模型的训练效果有较大的影响。解决方法有:多次训练模型,取对数边际分布最大的。或者使用KMeans模型进行初始化。程序中使用了KMeans初始化进行算法改进,使得GMM模型有稳定的高聚类正确率。 + +#### 3.3 GMM自主测试以及效果评估 + +​ 使用2.3中相同的参数生成数据进行测试,设定对数边际分布的波动阈值为$10^{-5}$进行实验。重复算法8次,结果如下: + +| 实验次数 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | +| -------- | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | +| SSE | 19.78 | 20.05 | 19.88 | 19.65 | 20.05 | 19.29 | 19.77 | 19.89 | +| ARI | 0.7778 | 0.7966 | 0.7812 | 0.7908 | 0.7910 | 0.7921 | 0.7908 | 0.7721 | + +​ 平均$SSE=19.795$,平均$ARI=0.78655$,效果比2.3中的KMeans略好一些,但不明显。 + +​ 为了能明显看出GMM对KMeans的优化,我使用以下参数构造了新的数据: +$$ +\mu_0=[12,18] ,\Sigma_0=\begin{bmatrix}1&0 \\\\ 0&1\end{bmatrix},N_0 = 200\\\\ +\mu_1=[15,10],\Sigma_1=\begin{bmatrix}2&0 \\\\ 0&10\end{bmatrix},N_1 = 800\\\\ +\mu_2=[18,18],\Sigma_2=\begin{bmatrix}1&0 \\\\ 0&1\end{bmatrix},N_2 = 200\\\\ +$$ +​ 绘制成散点图如下: + + + +​ Fig8:生成的数据分布 + + + + +​ 两种模型的聚类结果如下: + + + + Fig9:KMeans和GMM聚类效果对比 + + + + +​ 其中KMeans聚类的$SSE=3.1894,ARI=0.7269$;GMM聚类的$SSE=3.8150,ARI=0.9696$。显然,GMM考虑到了聚类的**高斯分布性质**,虽然$SSE$更大,但是$ARI$更加接近1,说明GMM的聚类结果比KMeans更接近于真实分布,GMM模型比KMeans更准确。 + +### 4. 自动选择聚簇数量的方法 + +#### 4.1 Elbow-KMeans方法 + +​ Elbow-KMeans方法以$SSE$作为评估指标。给定$K$的一个范围(比如说0~10),分别计算每一个$K$值对应的$SSE$,$SSE$会随着$K$的增加而降低,但对于有一定区分度的数据,在达到某个临界点时$SSE$会得到极大改善,之后缓慢下降,这个临界点就可以考虑为聚类性能较好的点。其图像像一个胳膊肘,如下图: + +![](img/elbow.png) + +​ Fig10:Elbow示意图 + + + + +​ 自动找**elbow**的方法:对一定范围内的$K$建立KMeans模型,计算其$SSE$。设$GradSSE[i] = SSE[i-1]-SSE[i]$,设定一个阈值$threshold=0.1$,当$GradSSE[k]/SSE[0] + + Fig11:ClusteringAlgorithm模型,SSE随k的变化图 + + + + +#### 4.2 一种可能的方法:最优Calinski-Harabasz指标 + +​ 2.2中提到的轮廓系数计算方法效率较慢,可以用一种简化的指标来替代。Calinski-Harabasz(CH)指标也是同时考虑了类内部距离和类间距离的指标。它的计算公式如下: +$$ +s(k) = \frac{tr(B_k)}{tr(W_k)}\frac{N-K}{K-1},\\\\ +B_k = \sum_{k=1}^{K}n_k(c_k-c_E)(c_k-c_E)^T,\\\\ +W_k = \sum_{k=1}^{K}\sum_{x_k}(x-c_k)(x-c_k)^T +$$ +​ 其中$B_k$是簇间协方差矩阵,$W_k$是簇内协方差矩阵。$tr$是矩阵的迹。$c$是簇的中心点。 + +​ 在实际使用的过程中,发现类别越少,CH的分数越高,当k=2时,分数达到最高。当上述情况发生的时候,需要一个个K值去测试,找到一个local maxium(局部最高)的分数,这个分数对应的K值就是当前最佳的分类。CH指标可以作为一个待选方法,当本次实验中并没有实现它。 + +### 5.参考文献 + +[1] [Calinski-Harbasz Score 详解](https://blog.csdn.net/chloe_ou/article/details/105277431?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3)https://blog.csdn.net/weixin_42147780/article/details/103238195) + +[2] [调整兰德系数(Adjusted Rand index,ARI)的计算](https://blog.csdn.net/qq_42887760/article/details/105728101) + +[3] [K-Means算法之K值的选择](https://www.codercto.com/a/26692.html) + +[4] [K-means算法性能评估及其优化](https://www.cnblogs.com/uestc-mm/p/10682633.html) + diff --git a/assignment-3/submission/18307130341/img/GMM_data_raw.png b/assignment-3/submission/18307130341/img/GMM_data_raw.png new file mode 100644 index 0000000000000000000000000000000000000000..a89f9181c98479c71ce780db6d2a22519167c003 Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_data_raw.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_test0.png b/assignment-3/submission/18307130341/img/GMM_test0.png new file mode 100644 index 0000000000000000000000000000000000000000..b6b5e935a7ea9276edc210d74421eeb977e33c43 Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test0.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_test1.png b/assignment-3/submission/18307130341/img/GMM_test1.png new file mode 100644 index 0000000000000000000000000000000000000000..da71843e734b4e86476ec6401b7d725109fa4bdf Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test1.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_test2.png b/assignment-3/submission/18307130341/img/GMM_test2.png new file mode 100644 index 0000000000000000000000000000000000000000..adad0d32c4babd08a3de53287d2f9714dde46e61 Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test2.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_test3.png b/assignment-3/submission/18307130341/img/GMM_test3.png new file mode 100644 index 0000000000000000000000000000000000000000..e4a5f2a6cf97da52c44f4e94e0f85df4bbedaa72 Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test3.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_test4.png b/assignment-3/submission/18307130341/img/GMM_test4.png new file mode 100644 index 0000000000000000000000000000000000000000..819d7d78edbbdfef3bbb5ac013a297319c4aa420 Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test4.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_test5.png b/assignment-3/submission/18307130341/img/GMM_test5.png new file mode 100644 index 0000000000000000000000000000000000000000..5befa6ae90d9d90946da3193bf69ff32e195cb7a Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test5.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_test6.png b/assignment-3/submission/18307130341/img/GMM_test6.png new file mode 100644 index 0000000000000000000000000000000000000000..93bdf18d49cf12803b7bd52ee78fd41a97e005a0 Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test6.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_test7.png b/assignment-3/submission/18307130341/img/GMM_test7.png new file mode 100644 index 0000000000000000000000000000000000000000..dddc949e4b1a5951898dcfe1286438292183e248 Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test7.png differ diff --git a/assignment-3/submission/18307130341/img/GMM_testx.png b/assignment-3/submission/18307130341/img/GMM_testx.png new file mode 100644 index 0000000000000000000000000000000000000000..556f2a9ddf9e4c777b054934dde8a4a4c94a4929 Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_testx.png differ diff --git a/assignment-3/submission/18307130341/img/cal_ari.png b/assignment-3/submission/18307130341/img/cal_ari.png new file mode 100644 index 0000000000000000000000000000000000000000..f18abc1fab2f5a31f460e12930d07579e587f06d Binary files /dev/null and b/assignment-3/submission/18307130341/img/cal_ari.png differ diff --git a/assignment-3/submission/18307130341/img/data_raw.png b/assignment-3/submission/18307130341/img/data_raw.png new file mode 100644 index 0000000000000000000000000000000000000000..64114d065fd29b65e9c4c4df422829f79dd5bb84 Binary files /dev/null and b/assignment-3/submission/18307130341/img/data_raw.png differ diff --git a/assignment-3/submission/18307130341/img/elbow.png b/assignment-3/submission/18307130341/img/elbow.png new file mode 100644 index 0000000000000000000000000000000000000000..3e21341af41d645cbd7e002af840aa92d22fe479 Binary files /dev/null and b/assignment-3/submission/18307130341/img/elbow.png differ diff --git a/assignment-3/submission/18307130341/img/elbow_test.png b/assignment-3/submission/18307130341/img/elbow_test.png new file mode 100644 index 0000000000000000000000000000000000000000..346b60be9b9e675c0eaa195d722f7f1d9f23f7cf Binary files /dev/null and b/assignment-3/submission/18307130341/img/elbow_test.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_Silhouette.png b/assignment-3/submission/18307130341/img/kmeans_Silhouette.png new file mode 100644 index 0000000000000000000000000000000000000000..c1d615591572bc334802fa6cfcc84294aa1e7dc2 Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_Silhouette.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_ari.png b/assignment-3/submission/18307130341/img/kmeans_ari.png new file mode 100644 index 0000000000000000000000000000000000000000..ae7ec1d2a57eb21d6110fd5b08d950668f0f1b02 Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_ari.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_epoch1.png b/assignment-3/submission/18307130341/img/kmeans_epoch1.png new file mode 100644 index 0000000000000000000000000000000000000000..3b42edbd812d548e8a0bae28e972f066c19d88d5 Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_epoch1.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_epoch2.png b/assignment-3/submission/18307130341/img/kmeans_epoch2.png new file mode 100644 index 0000000000000000000000000000000000000000..71c4475554af3e253b14e32670c5de79f0c2d247 Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_epoch2.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_epoch3.png b/assignment-3/submission/18307130341/img/kmeans_epoch3.png new file mode 100644 index 0000000000000000000000000000000000000000..516c55ebb0207e758b0d2b8324ed4005fc77e0cc Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_epoch3.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_epoch4.png b/assignment-3/submission/18307130341/img/kmeans_epoch4.png new file mode 100644 index 0000000000000000000000000000000000000000..2479c13c3402972a3f3509e5521b77675199d486 Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_epoch4.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_sse.png b/assignment-3/submission/18307130341/img/kmeans_sse.png new file mode 100644 index 0000000000000000000000000000000000000000..67e74d3ec602c3372d2c53b2ce66ec0a5d843fbf Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_sse.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_test1.png b/assignment-3/submission/18307130341/img/kmeans_test1.png new file mode 100644 index 0000000000000000000000000000000000000000..67ce51967f2128059c6cf29f3f679ff7011538d0 Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_test1.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_test2.png b/assignment-3/submission/18307130341/img/kmeans_test2.png new file mode 100644 index 0000000000000000000000000000000000000000..7f2879e461593de78fb5a1148d5892f609831d4d Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_test2.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_test3.png b/assignment-3/submission/18307130341/img/kmeans_test3.png new file mode 100644 index 0000000000000000000000000000000000000000..7d4259799d563ed415c6ba92ebff785ae6fd606d Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_test3.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_test4.png b/assignment-3/submission/18307130341/img/kmeans_test4.png new file mode 100644 index 0000000000000000000000000000000000000000..c0b55dcc83ce302fa51d8a1c013e6e18f8293bf4 Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_test4.png differ diff --git a/assignment-3/submission/18307130341/img/kmeans_testx.png b/assignment-3/submission/18307130341/img/kmeans_testx.png new file mode 100644 index 0000000000000000000000000000000000000000..4936fe18a8977c17323ecd43a67928b62fd103b0 Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_testx.png differ diff --git a/assignment-3/submission/18307130341/img/testx.png b/assignment-3/submission/18307130341/img/testx.png new file mode 100644 index 0000000000000000000000000000000000000000..206f670ee8b94ff5fcc1acd9661bfb936292e087 Binary files /dev/null and b/assignment-3/submission/18307130341/img/testx.png differ diff --git a/assignment-3/submission/18307130341/source.py b/assignment-3/submission/18307130341/source.py new file mode 100644 index 0000000000000000000000000000000000000000..3b26d3c472638524a4e647ff1dcfacee81f3ec8e --- /dev/null +++ b/assignment-3/submission/18307130341/source.py @@ -0,0 +1,296 @@ +import numpy as np +from numpy.core.fromnumeric import argmax + +class KMeans: + + def __init__(self, n_clusters): + self.K = n_clusters + self.centers = [] + + def fit(self, train_data): + N = train_data.shape[0] + # dim = train_data.shape[1] + min_sse = 1e6 + sel_centers = [] + shuffle_index = np.arange(N) + for _ in range(5): + #随机选取中心点 + np.random.shuffle(shuffle_index) + self.centers = [] + for k in range(self.K): + self.centers.append(train_data[shuffle_index[k]]) + last_sse = 1e6 + #迭代若干轮 + for _ in range(10): + cluster = [[] for _ in range(self.K)] + #计算点x与所有中心点的距离,找到最小值 + for x in train_data: + dis = [np.sqrt(np.sum((x-self.centers[k])*(x-self.centers[k]))) for k in range(self.K)] + cluster[int(np.argmin(dis))].append(x) + #重新计算簇的质心 + for k in range(self.K): + cluster[k] = np.array(cluster[k]) + if len(cluster[k]) != 0: + self.centers[k] = np.mean(cluster[k], axis = 0) + predict = self.predict(train_data) + #计算SSE,判断程序收敛 + sse = cal_sse(predict, train_data, self.K) + if abs(last_sse - sse) / last_sse < 1e-3: + break + last_sse = sse + #选取SSE最小的模型 + if sse < min_sse: + sel_centers = self.centers + min_sse = sse + self.centers = sel_centers + + def predict(self, test_data): + predict = [] + for x in test_data: + dis = [np.sqrt(np.sum((x-self.centers[k])*(x-self.centers[k]))) for k in range(self.K)] + predict.append(np.argmin(dis)) + return np.array(predict) + +class GaussianMixture: + + def __init__(self, n_clusters): + self.K = n_clusters #聚簇数量 K + self.means = [] + self.covars = [] + self.weights = [] + self.eps = 1e-40 + + #计算高斯分布概率 + def Gaussian(self, x, means, covars): + dim = len(x) + if dim > 1: + det = np.linalg.det(covars) + inv = np.matrix(np.linalg.inv(covars)) + x_mu = np.matrix(x - means).T + const = 1 / ((2 * np.pi) ** (dim / 2)) / (det ** (1 / 2)) + exp = -0.5 * x_mu.T * inv * x_mu + else: + covars += self.eps + const = 1 / ((2 * np.pi) ** (1 / 2)) / (covars ** (1 / 2)) + exp = -0.5 * ((x - means) ** 2) / covars + return float(const * np.exp(exp)) + + #E步 计算后验概率 + def e_step(self, train_data): + N = train_data.shape[0] + + self.gamma = np.zeros((N, self.K)) + for x in range(N): + for k in range(self.K): + self.gamma[x][k] = self.weights[k] * self.Gaussian(train_data[x], self.means[k], self.covars[k]) + self.gamma = self.gamma / (np.sum(self.gamma, axis = 1).reshape(N, 1) + self.eps) + + #M步 优化证据下界 + def m_step(self, train_data): + N = train_data.shape[0] + dim = train_data.shape[1] + + Nk = np.sum(self.gamma, axis = 0) + self.weights = np.mean(self.gamma, axis = 0) + + for k in range(self.K): + self.means[k] = (1.0/Nk[k]) * np.sum(np.array([self.gamma[n][k] * train_data[n] for n in range(N)]), axis=0) + xdiffs = train_data - self.means[k] + self.covars[k] = (1.0/Nk[k]) * np.sum([self.gamma[n][k]*(xdiffs[n]*xdiffs[n].reshape(dim,1)) for n in range(N)], axis=0) + + def fit(self, train_data): + N = train_data.shape[0] + dim = train_data.shape[1] + + #初始化 权重,均值,协方差矩阵 + self.weights = np.full((self.K), 1.0 / self.K) + #均值使用KMeans初始化 + model1 = KMeans(self.K) + model1.fit(train_data) + self.means = model1.centers + # self.means = np.random.uniform(0, 1, (self.K, dim)) * np.max(train_data, axis=0) + if dim == 2: + self.covars = np.array([np.identity(dim) for _ in range(self.K)]) + else: + self.covars = np.array([np.random.random()*np.var(train_data) for _ in range(self.K)]) + + #迭代 直到收敛 + # loglikelyhood = 0 + last_loglikelyhood = -1e8 + for _ in range(10): + self.e_step(train_data) + self.m_step(train_data) + #计算对数边际分布 + loglikelyhood = [] + for n in range(N): + tmp = [np.sum(self.weights[k] * self.Gaussian(train_data[n], self.means[k], self.covars[k])) for k in range(self.K)] + tmp = np.log(np.array(tmp)+self.eps) + loglikelyhood.append(tmp) + loglikelyhood = np.mean(loglikelyhood) + #判断算法收敛 + # print(self.means) + if abs(loglikelyhood - last_loglikelyhood) / -(last_loglikelyhood) < 1e-5: + break + last_loglikelyhood = loglikelyhood + + def predict(self, test_data): + N = test_data.shape[0] + predict = [] + for n in range(N): + tmp = [np.sum(self.weights[k] * self.Gaussian(test_data[n], self.means[k], self.covars[k])) for k in range(self.K)] + predict.append(np.argmax(np.array(tmp))) + predict = np.array(predict) + return predict + +class ClusteringAlgorithm: + + def __init__(self): + self.K = -1 + + def fit(self, train_data): + N = train_data.shape[0] + dim = train_data.shape[1] + + first = cal_sse([0]*N, train_data, 1) + last_sse = first + for k in range(2,10): + model = KMeans(k) + model.fit(train_data) + predict = model.predict(train_data) + sse = cal_sse(predict, train_data, k) + delta_sse = last_sse - sse + if delta_sse / first < 0.1: + self.K = k - 1 + break + last_sse = sse + self.model = KMeans(self.K) + self.model.fit(train_data) + print(self.K) + + def predict(self, test_data): + predict = self.model.predict(test_data) + return predict + +#计算误差平方和 +def cal_sse(predict, test_data, K): + N = len(test_data) + sse = [] + centers = [] + for k in range(K): + tmp = np.array([test_data[x] for x in range(N) if predict[x] == k]) + centers.append(np.mean(tmp, axis=0)) + for x in range(N): + diff = test_data[x] - centers[predict[x]] + sse.append(diff*diff) + return np.mean(sse) + +#计算轮廓系数 +def cal_Silhouette(predict, test_data): + N = len(test_data) + Silhouette = [] + for x in range(N): + a = [] + b = [] + for y in range(N): + if x != y: + dis = np.sqrt(np.sum((test_data[x]-test_data[y])*(test_data[x]-test_data[y]))) + if predict[x] == predict[y]: + a.append(dis) + else: + b.append(dis) + a = np.mean(a) + b = np.mean(b) + Silhouette.append((b-a)/max(a,b)) + return np.mean(Silhouette) + +#计算调整兰德指数ARI +def cal_ARI(K, label, predict, test_data): + N = len(test_data) + sum_label_predict = np.zeros((K,K), dtype = np.int) + sum_label = np.zeros(K, dtype = np.int) + sum_predict = np.zeros(K, dtype = np.int) + for x in range(N): + sum_label_predict[label[x]][predict[x]] += 1 + sum_label[label[x]] += 1 + sum_predict[predict[x]] += 1 + Index = 0 + Index_label = 0 + Index_predict = 0 + for i in range(K): + for j in range(K): + Index += sum_label_predict[i][j] * (sum_label_predict[i][j] - 1) / 2 + for i in range(K): + Index_label += sum_label[i] * (sum_label[i] - 1) / 2 + Index_predict += sum_predict[i] * (sum_predict[i] - 1) / 2 + Cn2 = N * (N - 1) / 2 + ExpectedIndex = Index_label * Index_predict / Cn2 + MaxIndex = 0.5 * (Index_label + Index_predict) + ARI = (Index - ExpectedIndex) / (MaxIndex - ExpectedIndex) + return ARI + +def draw_plot(data): + import matplotlib.pyplot as plt + plt.plot(data) + plt.show() + +def drawfig(data, label): + if data.shape[1] <= 1: + return + + import matplotlib.pyplot as plt + N = len(data) + ansx = [data[x] for x in range(N) if label[x] == 0] + ansy = [data[x] for x in range(N) if label[x] == 1] + ansz = [data[x] for x in range(N) if label[x] == 2] + ansx = np.array(ansx) + ansy = np.array(ansy) + ansz = np.array(ansz) + if len(ansx) > 0: + plt.scatter(ansx[:,0],ansx[:,1],color='r') + if len(ansy) > 0: + plt.scatter(ansy[:,0],ansy[:,1],color='b') + if len(ansz) > 0: + plt.scatter(ansz[:,0],ansz[:,1],color='g') + plt.show() + +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int)*i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + +def mytest(): + import matplotlib.pyplot as plt + mean = (1, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (800,)) + + mean = (16, -5) + cov = np.array([[21.2, 0], [0, 32.1]]) + y = np.random.multivariate_normal(mean, cov, (200,)) + + mean = (10, 22) + cov = np.array([[10, 5], [5, 10]]) + z = np.random.multivariate_normal(mean, cov, (1000,)) + + drawfig(np.concatenate((x,y,z)),[0]*800+[1]*200+[2]*1000) + + data, label = shuffle(x, y, z) + + model1 = KMeans(3) + model1.fit(data) + predict = model1.predict(data) + drawfig(data, predict) + + +if __name__ == "__main__": + mytest() + + diff --git a/assignment-3/submission/18307130341/tester_demo.py b/assignment-3/submission/18307130341/tester_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..19ec0e8091691d4aaaa6b53dbb695fde9e826d89 --- /dev/null +++ b/assignment-3/submission/18307130341/tester_demo.py @@ -0,0 +1,117 @@ +import numpy as np +import sys + +from source import KMeans, GaussianMixture + + +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int)*i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + + +def data_1(): + mean = (1, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (800,)) + + mean = (16, -5) + cov = np.array([[21.2, 0], [0, 32.1]]) + y = np.random.multivariate_normal(mean, cov, (200,)) + + mean = (10, 22) + cov = np.array([[10, 5], [5, 10]]) + z = np.random.multivariate_normal(mean, cov, (1000,)) + + data, _ = shuffle(x, y, z) + return (data, data), 3 + + +def data_2(): + train_data = np.array([ + [23, 12, 173, 2134], + [99, -12, -126, -31], + [55, -145, -123, -342], + ]) + return (train_data, train_data), 2 + + +def data_3(): + train_data = np.array([ + [23], + [-2999], + [-2955], + ]) + return (train_data, train_data), 2 + + +def test_with_n_clusters(data_fuction, algorithm_class): + (train_data, test_data), n_clusters = data_fuction() + model = algorithm_class(n_clusters) + model.fit(train_data) + res = model.predict(test_data) + assert len( + res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong" + return res + + +def testcase_1_1(): + test_with_n_clusters(data_1, KMeans) + return True + + +def testcase_1_2(): + res = test_with_n_clusters(data_2, KMeans) + return res[0] != res[1] and res[1] == res[2] + + +def testcase_2_1(): + test_with_n_clusters(data_1, GaussianMixture) + return True + + +def testcase_2_2(): + res = test_with_n_clusters(data_3, GaussianMixture) + return res[0] != res[1] and res[1] == res[2] + + +def test_all(err_report=False): + testcases = [ + ["KMeans-1", testcase_1_1, 4], + ["KMeans-2", testcase_1_2, 4], + # ["KMeans-3", testcase_1_3, 4], + # ["KMeans-4", testcase_1_4, 4], + # ["KMeans-5", testcase_1_5, 4], + ["GMM-1", testcase_2_1, 4], + ["GMM-2", testcase_2_2, 4], + # ["GMM-3", testcase_2_3, 4], + # ["GMM-4", testcase_2_4, 4], + # ["GMM-5", testcase_2_5, 4], + ] + sum_score = sum([case[2] for case in testcases]) + score = 0 + for case in testcases: + try: + res = case[2] if case[1]() else 0 + except Exception as e: + if err_report: + print("Error [{}] occurs in {}".format(str(e), case[0])) + res = 0 + score += res + print("+ {:14} {}/{}".format(case[0], res, case[2])) + print("{:16} {}/{}".format("FINAL SCORE", score, sum_score)) + + +if __name__ == "__main__": + if len(sys.argv) > 1 and sys.argv[1] == "--report": + test_all(True) + else: + test_all()