diff --git a/assignment-3/submission/18307130341/README.md b/assignment-3/submission/18307130341/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a55f5a564d88ede737aa00c8ba0e95ab450eebd9
--- /dev/null
+++ b/assignment-3/submission/18307130341/README.md
@@ -0,0 +1,243 @@
+# 实验报告:ASS3-KMeans & GMM
+
+18307130341 黄韵澄
+
+[toc]
+
+### 1.实验概述
+
+ 本实验分别实现了KMeans和GMM模型,在n_clusters数目给定的条件下,对输入的点集实现自动聚类。实验中还使用了Elbow Method 配合 KMeans算法,自动判断数据集中的聚簇的数量并进行聚类。
+
+### 2. K-Means方法
+
+#### 2.1 K-Means方法简述
+
+ 随机初始化`K`个`centers`,每次遍历数据点,并选取**距离**最近的`centers`作为其分类。分类完成后,在类中求其**质心**作为新的`centers`。迭代多次直到算法收敛,程序终止。
+
+ 考虑到程序的鲁棒性,要保证每次迭代的每个分类都不为空,其实只需保证一开始随机初始化的`centers`都是互不相同的数据点即可(这样至少保证这个数据点在此分类中)。具体做法是将`train_data`随机打乱,并选取前`K`个作为`centers`。
+
+#### 2.2 K-Means效果评估和收敛判断方法
+
+ 对K-Means效果评估考虑了以下三种方法:
+
+- 簇内误差平方和(SSE):计算样本与聚类中心的距离总和。只对单个簇中的数据分析,簇与簇之间的关系没有涉及。
+
+$$
+SSE = \sum_{i=1}^k \sum_{p\in c_i}|p -m_i|
+$$
+
+ 其中$C_i$表示第$i$个簇,$m_i$表示第$i$个簇的中心。
+
+- 轮廓系数:结合内聚度和分离度两种因素。
+
+$$
+C(x):数据点x所在的簇 \\\\
+a(x) = \frac{1}{|C(x)|-1}\cdot \sum_{p\in C(x)且 p\neq x}dis(x,p) \\\\
+b(x) = min_{p\notin C(x)}\{dis(x,p)\} \\\\
+S(x) = \frac{b(x)-a(x)}{max\{a(x),b(x)\}} \\\\
+轮廓系数 = 所有样本点的平均轮廓系数
+$$
+
+ 轮廓系数的值是介于 [-1,1] ,越趋近于1代表内聚度和分离度都相对较优。
+
+- 调整兰德指数(ARI):监督式评估指标,需要给定实际类别信息C,K是聚类结果。
+
+$$
+RI = \frac{a+b}{C_2^{n_{samples}}} \\\\
+ARI = \frac{RI-E[RI]}{max(RI)-E[RI]}
+$$
+
+ $a$表示在C与K中都是同类别的元素对数,$b$表示在C与K中都是不同类别的元素对数。$C_2^{n_{samples}}$表示总元素对数。
+
+ 要计算该值, 先计算出列联表(contingency table),表中每个值$n_{ij}$表示某个数据点同时属于$K(Y)$和 $C(X)$ 的个数,再通过该表可以计算 ARI 值:
+
+
+
+使用一个二维三分类的数据集简单测试一下上面三种指标:
+
+
+
+ Fig1:误差平方和(均值)随迭代次数的变化图
+
+
+
+ Fig2:轮廓系数随迭代次数的变化图
+
+
+
+ Fig3:调整兰德系数随迭代次数的变化图
+
+
+ 可以发现,随着迭代次数的增加,以上三个指标都有整体的收敛性。可以通过设定一个阈值`threshold`,当指标的波动幅度小于该阈值时,表示算法已收敛。考虑到$SSE$和$ARI$的计算时间复杂度是$O(N)$的,轮廓系数的计算时间复杂度是$O(N^2)$的,且$SSE$和轮廓系数是无监督指标,$ARI$的计算需要给定真实标签,所以程序中最终使用$SSE$作为**收敛性指标**,$ARI$作为**性能指标**。
+
+#### 2.3 自主实验以及算法改进
+
+ 程序中使用 `numpy` 的高斯分布,生成了三分类的二维数据集测试KMeans模型。生成数据集的参数如下:
+$$
+\mu_0=[1,2] ,\Sigma_0=\begin{bmatrix}73&0 \\\\ 0&22\end{bmatrix},N_0 = 800\\\\
+\mu_1=[16,-5],\Sigma_1=\begin{bmatrix}21&0 \\\\ 0&32\end{bmatrix},N_1 = 200\\\\
+\mu_2=[10,22],\Sigma_2=\begin{bmatrix}10&5 \\\\ 5&10\end{bmatrix},N_2 = 1000\\\\
+$$
+ 绘制成散点图如下:
+
+
+
+ Fig4:生成的数据分布
+
+
+
+
+ 设定$SSE$的波动阈值为$10^{-3}$进行实验,结果如下:
+
+
+
+ Fig5:KMeans模型聚类结果随迭代次数的变化图
+
+
+ 算法很快就收敛了(第4次迭代的时候结束),收敛后$SSE=19.24$。在图上也可以看出,epoch3和epoch4的聚类结果确实非常接近了。但是由于KMeans没有考虑到高斯分布的性质,最终聚类的结果和正确聚类结果还是有所差别的,这是KMeans这个算法固有的缺点。
+
+ 重复多次实验后,还会出现类似下图的问题:
+
+
+
+ Fig6:KMeans模型的一种偏差结果
+
+
+
+
+ 算法收敛后的聚类结果与预期结果相距甚远,$SSE=34.71$。主要原因是**随机初始化的中心点**聚集在某个簇中,导致KMeans会把一个簇分成多个簇,导致聚类错误。我的解决方法是重复训练KMeans模型若干次,选取$SSE$指标最优的模型作为预测模型。
+
+ 最后重复KMeans算法8次,结果如下:
+
+| 实验次数 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+| -------- | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
+| SSE | 19.66 | 19.44 | 20.18 | 19.91 | 20.10 | 19.29 | 20.05 | 20.19 |
+| ARI | 0.7788 | 0.7937 | 0.7736 | 0.7896 | 0.7913 | 0.7914 | 0.7908 | 0.7614 |
+
+ 平均$SSE=19.8525$,平均$ARI=0.7838$,可以和后面GMM方法作比较。
+
+### 3.GaussianMixture方法
+
+#### 3.1 GaussianMixture方法简述
+
+ (以下公式为多维高斯分布公式,若输入数据`dim=1`,把高斯分布改为一维高斯即可。)
+
+ GMM算法使用`K`个高斯分布来表示每个簇。在fit阶段,GMM根据规模为`N`的数据点学习其中的参数$\pi_k,\mu_k,\Sigma_k$。在predict阶段,点x的预测标签为$argmax_k(p(x|z=k))=argmax_kN(x;\mu_k,\Sigma_k)$。($\Sigma_k$为协方差矩阵)
+
+- 参数初始化(以下两种可选方法):
+ - 随机初始化:$\mu$同KMeans中的参数初始化,打乱数据点,选取前`K`个数据点作为初始均值;$\Sigma$初始化为单位阵;$\pi$初始化为均匀分布。
+ - KMeans初始化:用KMeans模型的结果预处理参数$\pi, \mu ,\Sigma$,再执行GMM算法。
+
+- fit阶段E步:固定参数$\mu,\Sigma$,计算后验分布$p(z^{(n)}|x^{(n)})$:
+
+$$
+\gamma_{nk} =p(z^{(n)}=k|x^{(n)}) = \frac{\pi_kN(x^{(n)};\mu_k,\Sigma_k)}{\sum_{k=1}^K\pi_kN(x^{(n)};\mu_k,\Sigma_k)}
+$$
+
+- fit阶段M步:优化证据下界$ELBO(\gamma,D;\pi,\mu,\sigma)$,约束$\sum_{k=1}^K\pi_k = 1$:
+
+$$
+N_k = \sum_{n=k}^N\gamma_{nk}, \\\\
+\pi_k = \frac{N_k}{N},\\\\
+\mu_k = \frac{1}{N_k}\sum_{n=1}^{N}\gamma_{nk}x^{(n)},\\\\
+\Sigma =\frac{1}{N_k}\sum_{n=1}^{N}\gamma_{nk}(x^{(n)}-\mu_k)(x^{(n)}-\mu_k)^T,\\\\
+$$
+
+- 收敛判断:以对数边际分布$\sum_{n=1}^{N}log\ p(x^{(n)})$作为收敛指标,设置某个波动阈值`threshold`。当对数边际分布的波动小于该阈值时,算法停止。
+
+#### 3.2 GMM模型的问题以及解决方法
+
+ 使用2.3中相同的参数生成数据进行测试,使用**随机初始化方法**,结果如下:
+
+
+
+ Fig7:GMM模型的6次聚类结果
+
+
+
+
+ 和2.3中KMeans中的问题一样,随机初始化对模型的训练效果有较大的影响。解决方法有:多次训练模型,取对数边际分布最大的。或者使用KMeans模型进行初始化。程序中使用了KMeans初始化进行算法改进,使得GMM模型有稳定的高聚类正确率。
+
+#### 3.3 GMM自主测试以及效果评估
+
+ 使用2.3中相同的参数生成数据进行测试,设定对数边际分布的波动阈值为$10^{-5}$进行实验。重复算法8次,结果如下:
+
+| 实验次数 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
+| -------- | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
+| SSE | 19.78 | 20.05 | 19.88 | 19.65 | 20.05 | 19.29 | 19.77 | 19.89 |
+| ARI | 0.7778 | 0.7966 | 0.7812 | 0.7908 | 0.7910 | 0.7921 | 0.7908 | 0.7721 |
+
+ 平均$SSE=19.795$,平均$ARI=0.78655$,效果比2.3中的KMeans略好一些,但不明显。
+
+ 为了能明显看出GMM对KMeans的优化,我使用以下参数构造了新的数据:
+$$
+\mu_0=[12,18] ,\Sigma_0=\begin{bmatrix}1&0 \\\\ 0&1\end{bmatrix},N_0 = 200\\\\
+\mu_1=[15,10],\Sigma_1=\begin{bmatrix}2&0 \\\\ 0&10\end{bmatrix},N_1 = 800\\\\
+\mu_2=[18,18],\Sigma_2=\begin{bmatrix}1&0 \\\\ 0&1\end{bmatrix},N_2 = 200\\\\
+$$
+ 绘制成散点图如下:
+
+
+
+ Fig8:生成的数据分布
+
+
+
+
+ 两种模型的聚类结果如下:
+
+
+
+ Fig9:KMeans和GMM聚类效果对比
+
+
+
+
+ 其中KMeans聚类的$SSE=3.1894,ARI=0.7269$;GMM聚类的$SSE=3.8150,ARI=0.9696$。显然,GMM考虑到了聚类的**高斯分布性质**,虽然$SSE$更大,但是$ARI$更加接近1,说明GMM的聚类结果比KMeans更接近于真实分布,GMM模型比KMeans更准确。
+
+### 4. 自动选择聚簇数量的方法
+
+#### 4.1 Elbow-KMeans方法
+
+ Elbow-KMeans方法以$SSE$作为评估指标。给定$K$的一个范围(比如说0~10),分别计算每一个$K$值对应的$SSE$,$SSE$会随着$K$的增加而降低,但对于有一定区分度的数据,在达到某个临界点时$SSE$会得到极大改善,之后缓慢下降,这个临界点就可以考虑为聚类性能较好的点。其图像像一个胳膊肘,如下图:
+
+
+
+ Fig10:Elbow示意图
+
+
+
+
+ 自动找**elbow**的方法:对一定范围内的$K$建立KMeans模型,计算其$SSE$。设$GradSSE[i] = SSE[i-1]-SSE[i]$,设定一个阈值$threshold=0.1$,当$GradSSE[k]/SSE[0]
+
+ Fig11:ClusteringAlgorithm模型,SSE随k的变化图
+
+
+
+
+#### 4.2 一种可能的方法:最优Calinski-Harabasz指标
+
+ 2.2中提到的轮廓系数计算方法效率较慢,可以用一种简化的指标来替代。Calinski-Harabasz(CH)指标也是同时考虑了类内部距离和类间距离的指标。它的计算公式如下:
+$$
+s(k) = \frac{tr(B_k)}{tr(W_k)}\frac{N-K}{K-1},\\\\
+B_k = \sum_{k=1}^{K}n_k(c_k-c_E)(c_k-c_E)^T,\\\\
+W_k = \sum_{k=1}^{K}\sum_{x_k}(x-c_k)(x-c_k)^T
+$$
+ 其中$B_k$是簇间协方差矩阵,$W_k$是簇内协方差矩阵。$tr$是矩阵的迹。$c$是簇的中心点。
+
+ 在实际使用的过程中,发现类别越少,CH的分数越高,当k=2时,分数达到最高。当上述情况发生的时候,需要一个个K值去测试,找到一个local maxium(局部最高)的分数,这个分数对应的K值就是当前最佳的分类。CH指标可以作为一个待选方法,当本次实验中并没有实现它。
+
+### 5.参考文献
+
+[1] [Calinski-Harbasz Score 详解](https://blog.csdn.net/chloe_ou/article/details/105277431?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3)https://blog.csdn.net/weixin_42147780/article/details/103238195)
+
+[2] [调整兰德系数(Adjusted Rand index,ARI)的计算](https://blog.csdn.net/qq_42887760/article/details/105728101)
+
+[3] [K-Means算法之K值的选择](https://www.codercto.com/a/26692.html)
+
+[4] [K-means算法性能评估及其优化](https://www.cnblogs.com/uestc-mm/p/10682633.html)
+
diff --git a/assignment-3/submission/18307130341/img/GMM_data_raw.png b/assignment-3/submission/18307130341/img/GMM_data_raw.png
new file mode 100644
index 0000000000000000000000000000000000000000..a89f9181c98479c71ce780db6d2a22519167c003
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_data_raw.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_test0.png b/assignment-3/submission/18307130341/img/GMM_test0.png
new file mode 100644
index 0000000000000000000000000000000000000000..b6b5e935a7ea9276edc210d74421eeb977e33c43
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test0.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_test1.png b/assignment-3/submission/18307130341/img/GMM_test1.png
new file mode 100644
index 0000000000000000000000000000000000000000..da71843e734b4e86476ec6401b7d725109fa4bdf
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test1.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_test2.png b/assignment-3/submission/18307130341/img/GMM_test2.png
new file mode 100644
index 0000000000000000000000000000000000000000..adad0d32c4babd08a3de53287d2f9714dde46e61
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test2.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_test3.png b/assignment-3/submission/18307130341/img/GMM_test3.png
new file mode 100644
index 0000000000000000000000000000000000000000..e4a5f2a6cf97da52c44f4e94e0f85df4bbedaa72
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test3.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_test4.png b/assignment-3/submission/18307130341/img/GMM_test4.png
new file mode 100644
index 0000000000000000000000000000000000000000..819d7d78edbbdfef3bbb5ac013a297319c4aa420
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test4.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_test5.png b/assignment-3/submission/18307130341/img/GMM_test5.png
new file mode 100644
index 0000000000000000000000000000000000000000..5befa6ae90d9d90946da3193bf69ff32e195cb7a
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test5.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_test6.png b/assignment-3/submission/18307130341/img/GMM_test6.png
new file mode 100644
index 0000000000000000000000000000000000000000..93bdf18d49cf12803b7bd52ee78fd41a97e005a0
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test6.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_test7.png b/assignment-3/submission/18307130341/img/GMM_test7.png
new file mode 100644
index 0000000000000000000000000000000000000000..dddc949e4b1a5951898dcfe1286438292183e248
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_test7.png differ
diff --git a/assignment-3/submission/18307130341/img/GMM_testx.png b/assignment-3/submission/18307130341/img/GMM_testx.png
new file mode 100644
index 0000000000000000000000000000000000000000..556f2a9ddf9e4c777b054934dde8a4a4c94a4929
Binary files /dev/null and b/assignment-3/submission/18307130341/img/GMM_testx.png differ
diff --git a/assignment-3/submission/18307130341/img/cal_ari.png b/assignment-3/submission/18307130341/img/cal_ari.png
new file mode 100644
index 0000000000000000000000000000000000000000..f18abc1fab2f5a31f460e12930d07579e587f06d
Binary files /dev/null and b/assignment-3/submission/18307130341/img/cal_ari.png differ
diff --git a/assignment-3/submission/18307130341/img/data_raw.png b/assignment-3/submission/18307130341/img/data_raw.png
new file mode 100644
index 0000000000000000000000000000000000000000..64114d065fd29b65e9c4c4df422829f79dd5bb84
Binary files /dev/null and b/assignment-3/submission/18307130341/img/data_raw.png differ
diff --git a/assignment-3/submission/18307130341/img/elbow.png b/assignment-3/submission/18307130341/img/elbow.png
new file mode 100644
index 0000000000000000000000000000000000000000..3e21341af41d645cbd7e002af840aa92d22fe479
Binary files /dev/null and b/assignment-3/submission/18307130341/img/elbow.png differ
diff --git a/assignment-3/submission/18307130341/img/elbow_test.png b/assignment-3/submission/18307130341/img/elbow_test.png
new file mode 100644
index 0000000000000000000000000000000000000000..346b60be9b9e675c0eaa195d722f7f1d9f23f7cf
Binary files /dev/null and b/assignment-3/submission/18307130341/img/elbow_test.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_Silhouette.png b/assignment-3/submission/18307130341/img/kmeans_Silhouette.png
new file mode 100644
index 0000000000000000000000000000000000000000..c1d615591572bc334802fa6cfcc84294aa1e7dc2
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_Silhouette.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_ari.png b/assignment-3/submission/18307130341/img/kmeans_ari.png
new file mode 100644
index 0000000000000000000000000000000000000000..ae7ec1d2a57eb21d6110fd5b08d950668f0f1b02
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_ari.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_epoch1.png b/assignment-3/submission/18307130341/img/kmeans_epoch1.png
new file mode 100644
index 0000000000000000000000000000000000000000..3b42edbd812d548e8a0bae28e972f066c19d88d5
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_epoch1.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_epoch2.png b/assignment-3/submission/18307130341/img/kmeans_epoch2.png
new file mode 100644
index 0000000000000000000000000000000000000000..71c4475554af3e253b14e32670c5de79f0c2d247
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_epoch2.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_epoch3.png b/assignment-3/submission/18307130341/img/kmeans_epoch3.png
new file mode 100644
index 0000000000000000000000000000000000000000..516c55ebb0207e758b0d2b8324ed4005fc77e0cc
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_epoch3.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_epoch4.png b/assignment-3/submission/18307130341/img/kmeans_epoch4.png
new file mode 100644
index 0000000000000000000000000000000000000000..2479c13c3402972a3f3509e5521b77675199d486
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_epoch4.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_sse.png b/assignment-3/submission/18307130341/img/kmeans_sse.png
new file mode 100644
index 0000000000000000000000000000000000000000..67e74d3ec602c3372d2c53b2ce66ec0a5d843fbf
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_sse.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_test1.png b/assignment-3/submission/18307130341/img/kmeans_test1.png
new file mode 100644
index 0000000000000000000000000000000000000000..67ce51967f2128059c6cf29f3f679ff7011538d0
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_test1.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_test2.png b/assignment-3/submission/18307130341/img/kmeans_test2.png
new file mode 100644
index 0000000000000000000000000000000000000000..7f2879e461593de78fb5a1148d5892f609831d4d
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_test2.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_test3.png b/assignment-3/submission/18307130341/img/kmeans_test3.png
new file mode 100644
index 0000000000000000000000000000000000000000..7d4259799d563ed415c6ba92ebff785ae6fd606d
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_test3.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_test4.png b/assignment-3/submission/18307130341/img/kmeans_test4.png
new file mode 100644
index 0000000000000000000000000000000000000000..c0b55dcc83ce302fa51d8a1c013e6e18f8293bf4
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_test4.png differ
diff --git a/assignment-3/submission/18307130341/img/kmeans_testx.png b/assignment-3/submission/18307130341/img/kmeans_testx.png
new file mode 100644
index 0000000000000000000000000000000000000000..4936fe18a8977c17323ecd43a67928b62fd103b0
Binary files /dev/null and b/assignment-3/submission/18307130341/img/kmeans_testx.png differ
diff --git a/assignment-3/submission/18307130341/img/testx.png b/assignment-3/submission/18307130341/img/testx.png
new file mode 100644
index 0000000000000000000000000000000000000000..206f670ee8b94ff5fcc1acd9661bfb936292e087
Binary files /dev/null and b/assignment-3/submission/18307130341/img/testx.png differ
diff --git a/assignment-3/submission/18307130341/source.py b/assignment-3/submission/18307130341/source.py
new file mode 100644
index 0000000000000000000000000000000000000000..3b26d3c472638524a4e647ff1dcfacee81f3ec8e
--- /dev/null
+++ b/assignment-3/submission/18307130341/source.py
@@ -0,0 +1,296 @@
+import numpy as np
+from numpy.core.fromnumeric import argmax
+
+class KMeans:
+
+ def __init__(self, n_clusters):
+ self.K = n_clusters
+ self.centers = []
+
+ def fit(self, train_data):
+ N = train_data.shape[0]
+ # dim = train_data.shape[1]
+ min_sse = 1e6
+ sel_centers = []
+ shuffle_index = np.arange(N)
+ for _ in range(5):
+ #随机选取中心点
+ np.random.shuffle(shuffle_index)
+ self.centers = []
+ for k in range(self.K):
+ self.centers.append(train_data[shuffle_index[k]])
+ last_sse = 1e6
+ #迭代若干轮
+ for _ in range(10):
+ cluster = [[] for _ in range(self.K)]
+ #计算点x与所有中心点的距离,找到最小值
+ for x in train_data:
+ dis = [np.sqrt(np.sum((x-self.centers[k])*(x-self.centers[k]))) for k in range(self.K)]
+ cluster[int(np.argmin(dis))].append(x)
+ #重新计算簇的质心
+ for k in range(self.K):
+ cluster[k] = np.array(cluster[k])
+ if len(cluster[k]) != 0:
+ self.centers[k] = np.mean(cluster[k], axis = 0)
+ predict = self.predict(train_data)
+ #计算SSE,判断程序收敛
+ sse = cal_sse(predict, train_data, self.K)
+ if abs(last_sse - sse) / last_sse < 1e-3:
+ break
+ last_sse = sse
+ #选取SSE最小的模型
+ if sse < min_sse:
+ sel_centers = self.centers
+ min_sse = sse
+ self.centers = sel_centers
+
+ def predict(self, test_data):
+ predict = []
+ for x in test_data:
+ dis = [np.sqrt(np.sum((x-self.centers[k])*(x-self.centers[k]))) for k in range(self.K)]
+ predict.append(np.argmin(dis))
+ return np.array(predict)
+
+class GaussianMixture:
+
+ def __init__(self, n_clusters):
+ self.K = n_clusters #聚簇数量 K
+ self.means = []
+ self.covars = []
+ self.weights = []
+ self.eps = 1e-40
+
+ #计算高斯分布概率
+ def Gaussian(self, x, means, covars):
+ dim = len(x)
+ if dim > 1:
+ det = np.linalg.det(covars)
+ inv = np.matrix(np.linalg.inv(covars))
+ x_mu = np.matrix(x - means).T
+ const = 1 / ((2 * np.pi) ** (dim / 2)) / (det ** (1 / 2))
+ exp = -0.5 * x_mu.T * inv * x_mu
+ else:
+ covars += self.eps
+ const = 1 / ((2 * np.pi) ** (1 / 2)) / (covars ** (1 / 2))
+ exp = -0.5 * ((x - means) ** 2) / covars
+ return float(const * np.exp(exp))
+
+ #E步 计算后验概率
+ def e_step(self, train_data):
+ N = train_data.shape[0]
+
+ self.gamma = np.zeros((N, self.K))
+ for x in range(N):
+ for k in range(self.K):
+ self.gamma[x][k] = self.weights[k] * self.Gaussian(train_data[x], self.means[k], self.covars[k])
+ self.gamma = self.gamma / (np.sum(self.gamma, axis = 1).reshape(N, 1) + self.eps)
+
+ #M步 优化证据下界
+ def m_step(self, train_data):
+ N = train_data.shape[0]
+ dim = train_data.shape[1]
+
+ Nk = np.sum(self.gamma, axis = 0)
+ self.weights = np.mean(self.gamma, axis = 0)
+
+ for k in range(self.K):
+ self.means[k] = (1.0/Nk[k]) * np.sum(np.array([self.gamma[n][k] * train_data[n] for n in range(N)]), axis=0)
+ xdiffs = train_data - self.means[k]
+ self.covars[k] = (1.0/Nk[k]) * np.sum([self.gamma[n][k]*(xdiffs[n]*xdiffs[n].reshape(dim,1)) for n in range(N)], axis=0)
+
+ def fit(self, train_data):
+ N = train_data.shape[0]
+ dim = train_data.shape[1]
+
+ #初始化 权重,均值,协方差矩阵
+ self.weights = np.full((self.K), 1.0 / self.K)
+ #均值使用KMeans初始化
+ model1 = KMeans(self.K)
+ model1.fit(train_data)
+ self.means = model1.centers
+ # self.means = np.random.uniform(0, 1, (self.K, dim)) * np.max(train_data, axis=0)
+ if dim == 2:
+ self.covars = np.array([np.identity(dim) for _ in range(self.K)])
+ else:
+ self.covars = np.array([np.random.random()*np.var(train_data) for _ in range(self.K)])
+
+ #迭代 直到收敛
+ # loglikelyhood = 0
+ last_loglikelyhood = -1e8
+ for _ in range(10):
+ self.e_step(train_data)
+ self.m_step(train_data)
+ #计算对数边际分布
+ loglikelyhood = []
+ for n in range(N):
+ tmp = [np.sum(self.weights[k] * self.Gaussian(train_data[n], self.means[k], self.covars[k])) for k in range(self.K)]
+ tmp = np.log(np.array(tmp)+self.eps)
+ loglikelyhood.append(tmp)
+ loglikelyhood = np.mean(loglikelyhood)
+ #判断算法收敛
+ # print(self.means)
+ if abs(loglikelyhood - last_loglikelyhood) / -(last_loglikelyhood) < 1e-5:
+ break
+ last_loglikelyhood = loglikelyhood
+
+ def predict(self, test_data):
+ N = test_data.shape[0]
+ predict = []
+ for n in range(N):
+ tmp = [np.sum(self.weights[k] * self.Gaussian(test_data[n], self.means[k], self.covars[k])) for k in range(self.K)]
+ predict.append(np.argmax(np.array(tmp)))
+ predict = np.array(predict)
+ return predict
+
+class ClusteringAlgorithm:
+
+ def __init__(self):
+ self.K = -1
+
+ def fit(self, train_data):
+ N = train_data.shape[0]
+ dim = train_data.shape[1]
+
+ first = cal_sse([0]*N, train_data, 1)
+ last_sse = first
+ for k in range(2,10):
+ model = KMeans(k)
+ model.fit(train_data)
+ predict = model.predict(train_data)
+ sse = cal_sse(predict, train_data, k)
+ delta_sse = last_sse - sse
+ if delta_sse / first < 0.1:
+ self.K = k - 1
+ break
+ last_sse = sse
+ self.model = KMeans(self.K)
+ self.model.fit(train_data)
+ print(self.K)
+
+ def predict(self, test_data):
+ predict = self.model.predict(test_data)
+ return predict
+
+#计算误差平方和
+def cal_sse(predict, test_data, K):
+ N = len(test_data)
+ sse = []
+ centers = []
+ for k in range(K):
+ tmp = np.array([test_data[x] for x in range(N) if predict[x] == k])
+ centers.append(np.mean(tmp, axis=0))
+ for x in range(N):
+ diff = test_data[x] - centers[predict[x]]
+ sse.append(diff*diff)
+ return np.mean(sse)
+
+#计算轮廓系数
+def cal_Silhouette(predict, test_data):
+ N = len(test_data)
+ Silhouette = []
+ for x in range(N):
+ a = []
+ b = []
+ for y in range(N):
+ if x != y:
+ dis = np.sqrt(np.sum((test_data[x]-test_data[y])*(test_data[x]-test_data[y])))
+ if predict[x] == predict[y]:
+ a.append(dis)
+ else:
+ b.append(dis)
+ a = np.mean(a)
+ b = np.mean(b)
+ Silhouette.append((b-a)/max(a,b))
+ return np.mean(Silhouette)
+
+#计算调整兰德指数ARI
+def cal_ARI(K, label, predict, test_data):
+ N = len(test_data)
+ sum_label_predict = np.zeros((K,K), dtype = np.int)
+ sum_label = np.zeros(K, dtype = np.int)
+ sum_predict = np.zeros(K, dtype = np.int)
+ for x in range(N):
+ sum_label_predict[label[x]][predict[x]] += 1
+ sum_label[label[x]] += 1
+ sum_predict[predict[x]] += 1
+ Index = 0
+ Index_label = 0
+ Index_predict = 0
+ for i in range(K):
+ for j in range(K):
+ Index += sum_label_predict[i][j] * (sum_label_predict[i][j] - 1) / 2
+ for i in range(K):
+ Index_label += sum_label[i] * (sum_label[i] - 1) / 2
+ Index_predict += sum_predict[i] * (sum_predict[i] - 1) / 2
+ Cn2 = N * (N - 1) / 2
+ ExpectedIndex = Index_label * Index_predict / Cn2
+ MaxIndex = 0.5 * (Index_label + Index_predict)
+ ARI = (Index - ExpectedIndex) / (MaxIndex - ExpectedIndex)
+ return ARI
+
+def draw_plot(data):
+ import matplotlib.pyplot as plt
+ plt.plot(data)
+ plt.show()
+
+def drawfig(data, label):
+ if data.shape[1] <= 1:
+ return
+
+ import matplotlib.pyplot as plt
+ N = len(data)
+ ansx = [data[x] for x in range(N) if label[x] == 0]
+ ansy = [data[x] for x in range(N) if label[x] == 1]
+ ansz = [data[x] for x in range(N) if label[x] == 2]
+ ansx = np.array(ansx)
+ ansy = np.array(ansy)
+ ansz = np.array(ansz)
+ if len(ansx) > 0:
+ plt.scatter(ansx[:,0],ansx[:,1],color='r')
+ if len(ansy) > 0:
+ plt.scatter(ansy[:,0],ansy[:,1],color='b')
+ if len(ansz) > 0:
+ plt.scatter(ansz[:,0],ansz[:,1],color='g')
+ plt.show()
+
+def shuffle(*datas):
+ data = np.concatenate(datas)
+ label = np.concatenate([
+ np.ones((d.shape[0],), dtype=int)*i
+ for (i, d) in enumerate(datas)
+ ])
+ N = data.shape[0]
+ idx = np.arange(N)
+ np.random.shuffle(idx)
+ data = data[idx]
+ label = label[idx]
+ return data, label
+
+def mytest():
+ import matplotlib.pyplot as plt
+ mean = (1, 2)
+ cov = np.array([[73, 0], [0, 22]])
+ x = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (16, -5)
+ cov = np.array([[21.2, 0], [0, 32.1]])
+ y = np.random.multivariate_normal(mean, cov, (200,))
+
+ mean = (10, 22)
+ cov = np.array([[10, 5], [5, 10]])
+ z = np.random.multivariate_normal(mean, cov, (1000,))
+
+ drawfig(np.concatenate((x,y,z)),[0]*800+[1]*200+[2]*1000)
+
+ data, label = shuffle(x, y, z)
+
+ model1 = KMeans(3)
+ model1.fit(data)
+ predict = model1.predict(data)
+ drawfig(data, predict)
+
+
+if __name__ == "__main__":
+ mytest()
+
+
diff --git a/assignment-3/submission/18307130341/tester_demo.py b/assignment-3/submission/18307130341/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ec0e8091691d4aaaa6b53dbb695fde9e826d89
--- /dev/null
+++ b/assignment-3/submission/18307130341/tester_demo.py
@@ -0,0 +1,117 @@
+import numpy as np
+import sys
+
+from source import KMeans, GaussianMixture
+
+
+def shuffle(*datas):
+ data = np.concatenate(datas)
+ label = np.concatenate([
+ np.ones((d.shape[0],), dtype=int)*i
+ for (i, d) in enumerate(datas)
+ ])
+ N = data.shape[0]
+ idx = np.arange(N)
+ np.random.shuffle(idx)
+ data = data[idx]
+ label = label[idx]
+ return data, label
+
+
+def data_1():
+ mean = (1, 2)
+ cov = np.array([[73, 0], [0, 22]])
+ x = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (16, -5)
+ cov = np.array([[21.2, 0], [0, 32.1]])
+ y = np.random.multivariate_normal(mean, cov, (200,))
+
+ mean = (10, 22)
+ cov = np.array([[10, 5], [5, 10]])
+ z = np.random.multivariate_normal(mean, cov, (1000,))
+
+ data, _ = shuffle(x, y, z)
+ return (data, data), 3
+
+
+def data_2():
+ train_data = np.array([
+ [23, 12, 173, 2134],
+ [99, -12, -126, -31],
+ [55, -145, -123, -342],
+ ])
+ return (train_data, train_data), 2
+
+
+def data_3():
+ train_data = np.array([
+ [23],
+ [-2999],
+ [-2955],
+ ])
+ return (train_data, train_data), 2
+
+
+def test_with_n_clusters(data_fuction, algorithm_class):
+ (train_data, test_data), n_clusters = data_fuction()
+ model = algorithm_class(n_clusters)
+ model.fit(train_data)
+ res = model.predict(test_data)
+ assert len(
+ res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong"
+ return res
+
+
+def testcase_1_1():
+ test_with_n_clusters(data_1, KMeans)
+ return True
+
+
+def testcase_1_2():
+ res = test_with_n_clusters(data_2, KMeans)
+ return res[0] != res[1] and res[1] == res[2]
+
+
+def testcase_2_1():
+ test_with_n_clusters(data_1, GaussianMixture)
+ return True
+
+
+def testcase_2_2():
+ res = test_with_n_clusters(data_3, GaussianMixture)
+ return res[0] != res[1] and res[1] == res[2]
+
+
+def test_all(err_report=False):
+ testcases = [
+ ["KMeans-1", testcase_1_1, 4],
+ ["KMeans-2", testcase_1_2, 4],
+ # ["KMeans-3", testcase_1_3, 4],
+ # ["KMeans-4", testcase_1_4, 4],
+ # ["KMeans-5", testcase_1_5, 4],
+ ["GMM-1", testcase_2_1, 4],
+ ["GMM-2", testcase_2_2, 4],
+ # ["GMM-3", testcase_2_3, 4],
+ # ["GMM-4", testcase_2_4, 4],
+ # ["GMM-5", testcase_2_5, 4],
+ ]
+ sum_score = sum([case[2] for case in testcases])
+ score = 0
+ for case in testcases:
+ try:
+ res = case[2] if case[1]() else 0
+ except Exception as e:
+ if err_report:
+ print("Error [{}] occurs in {}".format(str(e), case[0]))
+ res = 0
+ score += res
+ print("+ {:14} {}/{}".format(case[0], res, case[2]))
+ print("{:16} {}/{}".format("FINAL SCORE", score, sum_score))
+
+
+if __name__ == "__main__":
+ if len(sys.argv) > 1 and sys.argv[1] == "--report":
+ test_all(True)
+ else:
+ test_all()