diff --git a/assignment-3/submission/18307130116/README.md b/assignment-3/submission/18307130116/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1d5392c675488da60bf167158cd3e6975362c4d4
--- /dev/null
+++ b/assignment-3/submission/18307130116/README.md
@@ -0,0 +1,244 @@
+# Assignment3实验报告
+
+[toc]
+
+## 注释
+
+本次实验中所有函数均配有详尽的google风格的代码注释,因此除重点函数外,将略去实验报告中对API的详细介绍
+
+## 模型实现
+
+### 数据预处理
+
+为了避免极端值的出现,笔者将原先的数据各个维度投影到[-10, 10]区间,且等比例投影在欧式距离的情况下并不会影响到聚类结果,从而避免在计算概率值时出现指数运算溢出的情况
+
+### Kmeans
+
+#### 原理与收敛条件
+
+Kmeans的原理相对较为简单,在初始随机选中一些点作为聚类中心后,计算每个点对于聚类中心的距离,并归到最近的类中,更新聚类中心后重复该过程直至收敛。
+
+本次实验的收敛条件采用聚类中心是否变化作为最终条件,当某一轮更新后,聚类中心不发生改变,认为收敛。由于Kmeans对初值敏感,为避免某些随机初始化情况下在一些点上震荡,同时给定循环轮数的上界。
+
+#### 模型结构
+
+除给定的`init`,`fit`和`predict`三个API外,为方便使用,增添了`get_class`,`get_distance`,`update_center`三个API方便使用,`get_class`给定某个坐标点,在当前的聚类中心下,获得对应的类,`get_distance`:给定两个坐标点,获得对应的距离,`update_center`:根据聚类情况更新聚类中心
+
+### GMM
+
+#### 原理和问题
+
+高斯混合模型的基本假设为点的分布为高斯分布的叠加,从而整个过程是利用多个高斯分布对样本点拟合
+
+GMM同样**对于初值非常敏感**,另一方面GMM对于数据量要求也较高,可以想见在极端小的数据规模下,高斯分布拟合出的结果会是一个瘦高的类似于冲激函数的形态,这将意味着,较小的扰动都会带来较大的概率值变化,从而影响对数似然的敛散性,因此对于初始值选取,有较高的要求。
+
+另一方面,GMM模型更新采用的为EM算法,E步在固定协方差和均值后,计算高维高斯概率值,并加和归一化后获得最终结果,其目标是在E步利用变分方法,获得一个对后验概率的拟合,使得似然和证据下界相同,并在M步给定了后验估计后,优化高斯分布的参数,从而优化似然值
+
+#### 初始值选取
+
+前面提到,GMM算法对初始值非常敏感,尤其是在少样本的情况下,GMM算法会在自变量小范围震荡时,引发较大的似然值改变。实际代码实现时,若采用似然改变阈值作为收敛条件,会出现无法跑出结果的情况。
+
+笔者阅读SKlearn中GMM的源码,以及查找了相关资料,总结出了两种初始值选取方法
+
+* 采用抽样法,对训练数据进行抽样,并采取平均结果作为初始值
+* 采用Kmeans预先训练,采用聚类中心作为高斯分布的均值
+
+在本次实验中,采用第二种实现方法,经过实际试验测试,较好的改进了原先震荡的情况
+
+#### 模型结构
+
+除了给定的接口外,额外添加了`point_probability`和`calculate`两个api,其中`point_probabilty`用于计算点对于所有类的联合概率,且为避免出现0,在计算式添加$e^{-200}$,以完成平滑操作,`calculate`则计算点在某一类下的概率值
+
+### ClusteringAlgorithm
+
+#### 实现原理
+
+以Kmeans聚类方法为基础,clusteringAlgorithm主要完成的任务是筛选一个合适的聚类数K,将在后文采用两种方法完成
+
+#### K值选取方法实现
+
+##### 经验法
+
+经验法为K值的经验公式,一般为$\sqrt{n/2}$,在本轮实验中,该数值作为了Elbow法的上界
+
+##### Elbow
+
+Elbow方法主要寻找整个变化曲线中的拐点,对于如下图像,拐点处被认为是最合理的点,图中该点的类别为4,正确为3,大致在正确值附近
+
+
+
+在该点处,类别切分没有太细,且误差较小,一般考虑为较好的K值。在本次实验中,为了自动化寻找拐点,我们用数组`dis`记录下每个k的阈值,并且认为前一次下降的幅度如果小于后一次下降幅度的两倍,即达到了拐点处,具体选择情况见实验部分[自动化聚类算法实验结果](###自动化聚类算法实验结果)
+
+##### Canopy
+
+###### 原理
+
+Canopy为一种粗粒度的聚类方法,其原理为给定一个两个阈值t1, t2,满足t1>t2,在初始选定点后,按照某种距离度量,与初始点距离小于t2的点有较大概率不会成为类中心,将其从数据集中删去,小于t1的类型的点则有较大概率为同类型点,本次实验中因为只需要c运行后的类别数量,对同类型的有哪些不做考虑。重复上述过程直到整个数据集为空
+
+###### 实现与阈值设定
+
+Canopy的实现相对较为简单,这里主要讨论t1和t2值的给定,出于自动化获取k值的考量,这里选取t1,t2也通过获得数据维度的方法直接给定,考虑到真实数据分布为[-10, 10],认为$\sqrt{维度数}*2$为t2,而t1选2*t2,且考虑到在聚类中,一直聚类到所有点全部被划分完成,一定程度上会受到离群点影响,且Canopy预处理过程只需要获取最终的K值,对每个点的聚类情况并不关心。这里更改终止条件为,**当80%的点都完成聚类后,将终止Canopy算法并返回**,80%的阈值设定为考虑到二八法则,即认为80%的点能够较真实的反映数据的总体情况,最终结果参见实验部分[自动化聚类算法实验结果](###自动化聚类算法实验结果)
+
+## 实验部分
+
+### 数据生成与可视化
+
+数据生成部分重复利用了[Assignment1](https://gitee.com/fnlp/prml-21-spring/blob/master/assignment-1/submission/18307130116/source.py#L94)的API,并在此基础上添加了代码注释以及method接口,可以生成对数正态分布和高维高斯,可视化部分除了在[Assignment1](https://gitee.com/fnlp/prml-21-spring/blob/master/assignment-1/submission/18307130116/source.py#L129)基础上添加了代码注释以外,还添加了新的color参数,因为在考虑到实际聚类问题实现并不清楚各个类的分布情况,不着色绘图有时是合理的
+
+对应的API接口`data_generate_and_save`负责数据生成和保存,`data_load`负责数据载入,`visualize`负责数据可视化
+
+### 聚类效果评估方法
+
+除了直接观察法外,我们采用Silhouette Coefficient(轮廓系数,简称SC)作为聚类效果的衡量指标,SC的度量方式为计算一个点到其簇中其他点的平均距离,同该点到其他类所有点的平均距离最小值的比值,该比值介于[-1,1],且越趋近于1,表现越优
+
+`compute_SC`接口只需给定所有的点和其标签,即可完成计算
+
+### 基础实验
+
+#### 常规情况
+
+在这部分实验中,我们将简单生成一组数据,可视化Kmeans和GMM聚类结果,并和真实值相对比
+
+参数如下表所示
+
+| 类别 | 均值 | 方差 | 点数 |
+| ---- | ------- | ---------------- | ---- |
+| 1 | (1, -7) | [[1, 0], [0, 1]] | 800 |
+| 2 | (1, -4) | [[1, 0], [0, 1]] | 800 |
+| 3 | (1, 0) | [[1, 0], [0, 1]] | 800 |
+
+无着色图例和真实图例如下
+
+
+
+Kmeans获取的结果(左) 和 GMM获取的结果(右)如下图所示
+
+
+
+从图中可见,大体上模型都将其正确聚类,在细微表现上有所差异
+
+计算其SC值,Kmeans = 0.497,GMM = 0.497,两者相差不大,Kmeans略优于GMM
+
+#### 数据中存在极端值
+
+当数据中存在极端值时,即考虑某个维度较大,进行相关实验,参数如下
+
+| 类别 | 均值 | 方差 | 点数 |
+| ---- | --------- | ------------------ | ---- |
+| 1 | (1, -700) | [[1, 0], [0, 100]] | 800 |
+| 2 | (1, -400) | [[1, 0], [0, 100]] | 800 |
+| 3 | (1, 0) | [[1, 0], [0, 1]] | 800 |
+
+该部分实际测试为测试极端值是否会影响模型计算,实际效果上,属于各类区别很大的情况
+
+无着色图例和真实图例如下
+
+
+
+Kmeans获取的结果(左) 和 GMM获取的结果(右)如下图所示
+
+
+
+计算其SC值,Kmeans = 0.973,GMM = 0.973,两者完全相同
+
+#### 高度重叠下的聚类结果
+
+以上我们已经测试了重叠不大,无重叠两组数据,进一步比较,我们将测试高度重叠下的聚类结果
+
+采用如下参数
+
+| 类别 | 均值 | 方差 | 点数 |
+| ---- | ------ | ---------------- | ---- |
+| 1 | (1, 1) | [[2, 0], [0, 2]] | 800 |
+| 2 | (1, 2) | [[2, 0], [0, 2]] | 800 |
+| 3 | (1, 0) | [[2, 0], [0, 2]] | 800 |
+
+结果如下图所示
+
+无着色图例和真实图例如下
+
+
+
+Kmeans获取的结果(左) 和 GMM获取的结果(右)如下图所示
+
+
+
+计算其SC值,Kmeans = 0.323,GMM = 0.310,Kmeans略优于GMM,我们同时计算了高度重叠下的原始数据SC,为-0.003,聚类器聚类结果相比较原始类别更优,但是在实际操作过程中,这样可能具有欺诈性,因此在使用聚类器时,应该首先对数据分布有所了解,而不应该直接使用聚类器尝试,否则可能得到“评价较优”的错误结果
+
+### GMM模型补充实验
+
+我们GMM在进行拟合时,初值使用了Kmeans获取,考虑到GMM的意义不止在于聚类,更在于能够估算叠加形式的高斯分布参数,我们进一步尝试了非高斯分布在GMM下的分类情况,仅作为补充了解,非高斯分布在GMM下会有怎样的表现
+
+
+
+
+
+Kmeans获取的结果(左) 和 GMM获取的结果(右)如下图所示
+
+
+
+计算其SC值,Kmeans = 0.677,GMM = 0.62009476,原数据的SC为-0.0348,对数正态会被GMM理解成高度重叠的高斯分布,并没有对实际效果产生过大的影响
+
+### 自动化聚类算法实验结果
+
+ClusteringAlgorithm自动化实现是基于Kmeans完成的,因此在以下实验中,我们仅关注K值选取,而不重复进行Kmeans聚类效果实验
+
+我们首先根据某一组参数,多轮测试基于ELBOW算法和Canopy算法的结果,真实值为K = 3,数据参数见表格,选取某一轮可视化如下图,加号为测试数据,有类别版本见右侧
+
+
+
+| 类别 | 均值 | 方差 | 点数 |
+| ---- | -------- | ---------------------- | ---- |
+| 1 | (1, 2) | [[73, 0], [0, 22]] | 800 |
+| 2 | (16, -5) | [[21.2, 0], [0, 32.1]] | 200 |
+| 3 | (10, 22) | [[10, 5], [5, 10]] | 1000 |
+
+| 轮数 | ELBOW自动获取的K | Canopy自动获取的K |
+| ---- | ---------------- | ----------------- |
+| 1 | 5 | 4 |
+| 2 | 4 | 4 |
+| 3 | 4 | 5 |
+| 4 | 3 | 5 |
+| 5 | 4 | 4 |
+
+可以看出,ELBOW和Canopy算法获得K都和真实值相差不大,较好的完成了自动化的工作,相比之下,ELBOW需要跑$\sqrt{n/2}$轮Kmeans,而Canopy算法只需要跑一轮即可,速度相对较快,ELBOW算法的波动情况更突出,能否取得最优结果取决于Kmeans过程中的随机初值,但总的来说完成情况较好
+
+### Canopy鲁棒性测试
+
+根据上文结果,ELBOW和Canopy算法获得的K值结果相差不大,且Canopy算法速度远快于ELBOW,但是由于依赖于阈值t1,t2,而在本次实验,阈值是对于同一维度数据完全相同的,因此增添鲁棒性测试。而ELBOW算法为暴力枚举K,以获得最好的结果,相对而言可以预想到鲁棒性较好,这里略去Kmeans的测试
+
+为保证阈值相同,笔者测试了多组二维数据,参数和结果如下表格所示,增添第三组几乎不相交组和第四组高度重叠组,测试实际效果
+
+| 轮数 | 均值 | 协方差 | 各类别点数 | 真实类别数 | 结果 |
+| ---- | ------------------------------ | ----------------------------------------------------------- | --------------- | ---------- | ---- |
+| 1 | [(1, -7), (1, -4), (1, 0)] | 单位阵 | [800, 800, 800] | 3 | 3 |
+| 2 | [(1, -7), (1, -4), (1, 0)] | [[10, 0],[0, 1]] [[1, 0], [0, 10]] [[2, 0], [6, 5]] | [800, 800, 800] | 3 | 3 |
+| 3 | [(10, 10), (-10, -10), (5, 0)] | 单位阵 | [800, 800, 800] | 3 | 3 |
+| 4 | [(1, 1), (1, 2), (1, 0)] | [[2, 0],[0, 2]] [[2, 0], [0, 2]] [[2, 0], [0, 2]] | [800, 800, 800] | 3 | 4 |
+
+高度重叠组和完全分离数据可视化如下:
+
+
+
+在高度重叠的数据下,Canopy方法仍表现出较好的性能,模型对分布鲁棒
+
+紧接着,我们测试模型对分布点数的鲁棒性,全部采取上表中的第一组参数,仅更改各类别点数
+
+| 轮数 | 点数 | 结果 |
+| ---- | ---------------- | ---- |
+| 1 | [80, 1000, 80] | 1 |
+| 2 | [200, 1000, 10] | 2 |
+| 3 | [200, 1000, 100] | 3 |
+
+可以看到点数对Canopy的影响较大,考虑到实际实现过程中,采用二八法则忽略少量离群点对于整体聚类数量的影响,在某个类别显著高于其他类时,其他类别的点将会被忽略,且距离越近忽略的可能性越大,这点也符合实际聚类时的特征
+
+进一步的,我们测试多类情况下的K值选取,[a,b]表示多组测试下,K值最终落入区间[a, b]
+
+| 轮数 | 均值 | 协方差 | 各类别点数 | 真实类别数 | 结果 |
+| ---- | ----------------------------------- | ------------------------------------------------------------ | --------------- | ---------- | ----- |
+| 1 | [(1, -7), (1, -4), (1, 0), (-2, 0)] | 单位阵 | [800, 800, 800] | 4 | [3,4] |
+| 2 | [(1, -7), (1, -4), (1, 0), (-2, 0)] | [[10, 0],[0, 1]] [[1, 0], [0, 10]] [[2, 0], [6, 5]] [[3, 0], [1, 5]] | [800, 800, 800] | 4 | [3,4] |
+
+#### 总结
+
+在这部分中,我们看到,Canopy的鲁棒性对于分布情况较好,在重叠程度较大和较小时都有较好的性能,但是对于各类点数鲁棒性较差,容易忽略一些少量点,这与实现方式相关,也符合预期
\ No newline at end of file
diff --git "a/assignment-3/submission/18307130116/img/GMM\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" "b/assignment-3/submission/18307130116/img/GMM\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png"
new file mode 100644
index 0000000000000000000000000000000000000000..c369d0d437a06ee47a31c8aa28fc5b2360e17ea0
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/GMM\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" differ
diff --git "a/assignment-3/submission/18307130116/img/GMM\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203.png" "b/assignment-3/submission/18307130116/img/GMM\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203.png"
new file mode 100644
index 0000000000000000000000000000000000000000..5577014148822a67b7cff48ba5e73e5e5ddcff73
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/GMM\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203.png" differ
diff --git "a/assignment-3/submission/18307130116/img/GMM\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/GMM\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..40088f3cb594956cd7f3e9ccc6705ccf45827db3
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/GMM\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" differ
diff --git "a/assignment-3/submission/18307130116/img/GMM\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/GMM\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..7294753f3d2219aece852691320d7f64611eb510
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/GMM\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" differ
diff --git "a/assignment-3/submission/18307130116/img/Kmeans\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" "b/assignment-3/submission/18307130116/img/Kmeans\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png"
new file mode 100644
index 0000000000000000000000000000000000000000..d1aa314ef755065b8965ab3b811c8af5325ee35e
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/Kmeans\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" differ
diff --git "a/assignment-3/submission/18307130116/img/Kmeans\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/Kmeans\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..ecff0a29f27fcd79d94e5f8195b6752aef9fba0b
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/Kmeans\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" differ
diff --git "a/assignment-3/submission/18307130116/img/Kmeans\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/Kmeans\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..dab91ad6746eecb423c695a38749c1bb5fc0de42
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/Kmeans\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" differ
diff --git "a/assignment-3/submission/18307130116/img/Kmeans\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/Kmeans\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..55b8377dc1f11b74d71c0eadd77dd40a5c4df74d
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/Kmeans\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" differ
diff --git a/assignment-3/submission/18307130116/img/elbow.png b/assignment-3/submission/18307130116/img/elbow.png
new file mode 100644
index 0000000000000000000000000000000000000000..f5a2e5e471a9f91630b179c9d9fa2fc478967176
Binary files /dev/null and b/assignment-3/submission/18307130116/img/elbow.png differ
diff --git "a/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..79e2ea68d5a4d211a87599c3267bd2e48cf0daa4
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\347\255\224\346\241\210.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\351\242\230\347\233\256.png" "b/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\351\242\230\347\233\256.png"
new file mode 100644
index 0000000000000000000000000000000000000000..90c9da70c56b6adb825c55a7372aa37693a66dec
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\351\242\230\347\233\256.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\345\256\214\345\205\250\345\210\206\347\246\273\346\225\260\346\215\256.png" "b/assignment-3/submission/18307130116/img/\345\256\214\345\205\250\345\210\206\347\246\273\346\225\260\346\215\256.png"
new file mode 100644
index 0000000000000000000000000000000000000000..572bf49a5ceaa1b1ce3e2493d81be2553c621e6d
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\256\214\345\205\250\345\210\206\347\246\273\346\225\260\346\215\256.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..d5c27c4b1854debe84105d77a636ea0119368383
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\351\242\230\347\233\256.png" "b/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\351\242\230\347\233\256.png"
new file mode 100644
index 0000000000000000000000000000000000000000..1fb0fc17af5f4239734c8572df31d2f4ae68537f
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\351\242\230\347\233\256.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\346\227\240\347\261\273\345\210\253.png" "b/assignment-3/submission/18307130116/img/\346\227\240\347\261\273\345\210\253.png"
new file mode 100644
index 0000000000000000000000000000000000000000..144f73c866afc1c56cc6d50b5e9c01b342d8da2f
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\346\227\240\347\261\273\345\210\253.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\346\234\211\347\261\273\345\210\253data.png" "b/assignment-3/submission/18307130116/img/\346\234\211\347\261\273\345\210\253data.png"
new file mode 100644
index 0000000000000000000000000000000000000000..6d97fae31b24e72ecb5f4b19c52faa03636fbdf8
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\346\234\211\347\261\273\345\210\253data.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..dab91ad6746eecb423c695a38749c1bb5fc0de42
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\351\242\230\347\233\256.png" "b/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\351\242\230\347\233\256.png"
new file mode 100644
index 0000000000000000000000000000000000000000..35a9a8754cd5d6d4e8ca3b3d8799f0110dc46f68
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\351\242\230\347\233\256.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\346\225\260\346\215\256.png" "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\346\225\260\346\215\256.png"
new file mode 100644
index 0000000000000000000000000000000000000000..74fafa0211602f3c8c6682127e3196cfa97bde11
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\346\225\260\346\215\256.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png"
new file mode 100644
index 0000000000000000000000000000000000000000..2fe7b8b7aaae54eebe4001233150cba346870577
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" differ
diff --git "a/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\351\242\230\347\233\256.png" "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\351\242\230\347\233\256.png"
new file mode 100644
index 0000000000000000000000000000000000000000..a9ff7f42e30d93a44bff57b9facb19111a2b75bc
Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\351\242\230\347\233\256.png" differ
diff --git a/assignment-3/submission/18307130116/source.py b/assignment-3/submission/18307130116/source.py
new file mode 100644
index 0000000000000000000000000000000000000000..5487cd0c73af70ba038bd590a54d5cfa3c8c080a
--- /dev/null
+++ b/assignment-3/submission/18307130116/source.py
@@ -0,0 +1,649 @@
+import numpy as np
+import random
+import matplotlib.pyplot as plt
+import math
+import matplotlib.cm as cm
+from numpy.core.fromnumeric import argmin
+
+def data_preprocess(data):
+ """preprocess the data
+
+ use the range of data to transform the data
+
+ Args:
+ data(numpy.ndarray):raw data
+
+ Return:
+ numpy.ndarray: data after process
+ """
+
+ edge = max(abs(data.max()), abs(data.min()))
+ result_data = (data*10)/edge
+ return result_data
+
+
+def compute_SC(data, label, class_num):
+ """compute the Silhouette Coefficient
+
+ Args:
+ data(numpy.ndarray): data for compute
+ lable(list):label for every point
+ class_num(int): the number of cluster
+
+ Return:
+ int: the value of Silhouette Coefficient
+ """
+ point_dict = {}
+ data = data_preprocess(data)
+ if len(data.shape) == 1:
+ dimention = 1
+ else:
+ dimention = data.shape[1]
+ for iter in range(class_num):
+ point_dict[iter] = []
+ for iter in range(len(data)):
+ point_dict[label[iter]].append(data[iter])
+ result = 0
+ for iter in range(len(data)):
+ now_point = data[iter]
+ now_point = now_point.reshape(-1, 1)
+ inner_dis = 0
+ now_label = label[iter]
+ for other in point_dict[now_label]:
+ other = other.reshape(-1, 1)
+ temp = 0
+ for i in range(dimention):
+ temp = temp + (now_point[i]-other[i]) ** 2
+ inner_dis = inner_dis + temp**0.5
+ inner_dis = inner_dis / (len(point_dict[now_label]) - 1)
+ out_dis_min = math.inf
+ for label_iter in range(class_num):
+ if label_iter == now_label:
+ continue
+ out_dis = 0
+ for other in point_dict[label_iter]:
+ other = other.reshape(-1, 1)
+ temp = 0
+ for i in range(dimention):
+ temp = temp + (now_point[i]-other[i]) ** 2
+ out_dis = out_dis + temp**0.5
+ out_dis = out_dis / len(point_dict[label_iter])
+ if out_dis < out_dis_min:
+ out_dis_min = out_dis
+ result = result + (out_dis_min - inner_dis)/max(out_dis_min, inner_dis)
+ result = result / len(data)
+ return result
+
+
+def data_generate_and_save(class_num, mean_list, cov_list, num_list, save_path = "", method = "Gaussian"):
+ """generate data that obey Guassian distribution
+
+ label will be saved in the meantime.
+
+ Args:
+ class_num(int): the number of class
+ mean_list(list): mean_list[i] stand for the mean of class[i]
+ cov_list(list): similar to mean_list, stand for the covariance
+ num_list(list): similar to mean_list, stand for the number of points in class[i]
+ save_path(str): the data storage path, end with slash.
+ method(str): the distribution data will follow, support gaussian and lognormal
+ """
+ if method == "lognormal":
+ data = np.random.lognormal(mean_list[0], cov_list[0], num_list[0])
+ elif method == "Gaussian":
+ data = np.random.multivariate_normal(mean_list[0], cov_list[0], (num_list[0],))
+ label = np.zeros((num_list[0],),dtype=int)
+ total = num_list[0]
+
+ for iter in range(1, class_num):
+ if method == "lognormal":
+ temp = np.random.lognormal(mean_list[0], cov_list[0], num_list[0])
+ elif method == "Gaussian":
+ temp = np.random.multivariate_normal(mean_list[0], cov_list[0], (num_list[0],))
+ label_temp = np.ones((num_list[iter],),dtype=int)*iter
+ data = np.concatenate([data, temp])
+ label = np.concatenate([label, label_temp])
+ total += num_list[iter]
+
+ idx = np.arange(total)
+ np.random.shuffle(idx)
+ data = data[idx]
+ label = label[idx]
+ train_num = int(total * 0.8)
+ train_data = data[:train_num, ]
+ test_data = data[train_num:, ]
+ train_label = label[:train_num, ]
+ test_label = label[train_num:, ]
+ np.save(save_path+"data.npy", ((train_data, train_label), (test_data, test_label)))
+
+
+def data_load(path = ""):
+ """load data from path given
+
+ data should follow the format(data, label)
+
+ Args:
+ path(str): the path data stored in
+
+ Return:
+ tuple: stand for the data and label
+ """
+
+ (train_data, train_label), (test_data, test_label) = np.load(path+"data.npy",allow_pickle=True)
+ return (train_data, train_label), (test_data, test_label)
+
+
+def visualize(data, label=None, dimention = 2, class_num = 1, test_data=np.array([None]), color=False):
+ """draw a scatter
+
+ if you want to distinguish class with color, parameter color should be True.
+ It will distribute color for each class automatically.
+ The test data will be marked with the label of plus
+
+ Args:
+ data(numpy.ndarray):train dataset
+ label(numpy.ndarray):label for train data
+ class_num(int):the number of clusters, only used when color = True
+ test_data(numpy.ndarray): test dataset
+ color(boolean): True if you want different for each class, otherwise false
+ dimention(int): the data dimention, should be 1 or2
+ """
+
+ if color == True:
+ data_x = {}
+ data_y = {}
+ for iter in range(class_num):
+ data_x[iter] = []
+ data_y[iter] = []
+ if dimention == 2:
+ for iter in range(len(label)):
+ data_x[label[iter]].append(data[iter, 0])
+ data_y[label[iter]].append(data[iter, 1])
+ elif dimention == 1:
+ for iter in range(len(label)):
+ data_x[label[iter]].append(data[iter])
+ data_y[label[iter]].append(0)
+ colors = cm.rainbow(np.linspace(0, 1, class_num))
+ for class_idx, c in zip(range(class_num), colors):
+ plt.scatter(data_x[class_idx], data_y[class_idx], color=c)
+ if(test_data.any() != None):
+ if dimention == 2:
+ plt.scatter(test_data[:, 0], test_data[:, 1], marker='+')
+ elif dimention == 1:
+ plt.scatter(test_data, np.zeros(len(test_data)), marker='+')
+ else:
+ if dimention == 2:
+ plt.scatter(data[:, 0], data[:, 1], marker='o')
+ elif dimention == 1:
+ plt.scatter(data, np.zeros(len(data)), marker='o')
+ if(test_data.any() != None):
+ if dimention == 2:
+ plt.scatter(test_data[:, 0], test_data[:, 1], marker='+')
+ elif dimention == 1:
+ plt.scatter(test_data, np.zeros(len(test_data)), marker='+')
+ plt.show()
+
+
+class Canopy:
+ """Canopy clustering method
+
+ The model will init thereshold automatically according to
+ the dimention of dataset
+ A low accuracy method to get cluster number K for the whole dataset
+
+ Attribute:
+ t1(int): stand for the first thershold
+ t2(int): stand for the second thereshold
+ dimention: dimention of data
+ """
+ def __init__(self):
+ """init the whole model
+ """
+ self.t1 = 0
+ self.t2 = 0
+ self.dimention = 0
+
+ def get_distance(self, point1, point2, method="Euclidean"):
+ """Compute the distance between two points
+
+ Check the dimention first and will throw an warning if dimention
+ differ. Will use the mininum dimention for compute. Different method will
+ support in the future.
+
+ Args:
+ point1(numpy.ndarray):One point for compute
+ point2(numpy.ndarray):The other point for compute
+ method(str):The way to compute distance
+
+ Return:
+ float: distance between two points
+ """
+
+ dis = 0
+ point1 = point1.reshape(-1, 1)
+ point2 = point2.reshape(-1, 1)
+ if method == "Euclidean":
+ for iter in range(self.dimention):
+ dis += (point1[iter]-point2[iter]) ** 2
+ return dis ** 0.5
+
+ def fit(self, train_data):
+ """train the model
+
+ Args:
+ train_data(numpy.ndarray): dataset for training
+
+ Return:
+ list: contain the turple with the format of (center point, [points around])
+ """
+
+ train_data = data_preprocess(train_data)
+ train_num = train_data.shape[0]
+ if len(train_data.shape) == 1:
+ self.dimention = 1
+ else:
+ self.dimention = train_data.shape[1]
+ self.t2 = 2 * self.dimention**0.5
+ self.t1 = 2 * self.t2
+ result = []
+ while len(train_data) >= 0.2 * train_num:
+ idx = random.randint(0, len(train_data) - 1)
+ center = train_data[idx]
+ point_list = []
+ point_need_delete = []
+ train_data = np.delete(train_data, idx, 0)
+ for iter in range(len(train_data)):
+ dis = self.get_distance(train_data[iter], center)
+ if dis < self.t2:
+ point_need_delete.append(iter)
+ elif dis < self.t1:
+ point_list.append(train_data[iter])
+ result.append((center, point_list))
+ train_data = np.delete(train_data, point_need_delete, 0)
+ return result
+
+
+class KMeans:
+ """Kmeans Clustering Algorithm
+
+ Attributes:
+ n_clusters(int): number of clusters
+ cluster_center(list): center of each cluster
+ class_point_dict(dict): point dict of each cluster
+ dimention(int): dimention of data
+ """
+
+ def __init__(self, n_clusters):
+ """Inits the clusterer
+
+ Init the points dicts and number of clusters with empty.
+ Init the cluster number with argument
+
+ Args:
+ n_clusters(int): number of clusters
+ """
+
+ self.n_clusters = n_clusters
+ self.cluster_center = []
+ self.class_point_dict = {}
+ self.dimention = 0
+
+ def fit(self, train_data):
+ """Train the clusterer
+
+ Get the dimention of data and random select the center of each cluster first.
+ Label each point with the label of cluster center nearest to it.
+ Loop until cluster center don't change
+
+ Args:
+ train_data(numpy.ndarray): training data for this task
+ """
+
+ train_data = data_preprocess(train_data)
+ train_data_num = train_data.shape[0]
+ if len(train_data.shape) == 1:
+ self.dimention = 1
+ else:
+ self.dimention = train_data.shape[1]
+ for iter in range(self.n_clusters):
+ self.class_point_dict[iter] = []
+ idx = random.sample(range(train_data_num), self.n_clusters)
+ for iter in range(self.n_clusters):
+ self.cluster_center.append(train_data[idx[iter]])
+ for iter in train_data:
+ label = self.get_class(iter)
+ self.class_point_dict[label].append(iter)
+ epoch = 0
+ while not self.update_center() and epoch < 100:
+ for label in range(self.n_clusters):
+ self.class_point_dict[label] = []
+ for iter in train_data:
+ label = self.get_class(iter)
+ self.class_point_dict[label].append(iter)
+ epoch = epoch + 1
+
+ def predict(self, test_data):
+ """Predict label
+
+ Args:
+ test_data(numpy.ndarray): Data for test
+
+ Return:
+ numpy.ndarray: each label of test_data
+ """
+
+ test_data = data_preprocess(test_data)
+ result = np.array([])
+ for iter in test_data:
+ label = self.get_class(iter)
+ result = np.r_[result, np.array([label])]
+ return result
+
+ def get_class(self, point):
+ """Get the point class according to center points
+
+ Args:
+ point(numpy.ndarray): Point for found
+
+ Return:
+ int: In range(clusters number), stand for the class
+ """
+
+ min_class = 0
+ for iter in range(self.n_clusters):
+ temp = self.get_distance(self.cluster_center[iter], point)
+ if iter == 0:
+ min_dis = temp
+ else:
+ if min_dis > temp:
+ min_dis = temp
+ min_class = iter
+ return min_class
+
+ def get_distance(self, point1, point2, method="Euclidean"):
+ """Compute the distance between two points
+
+ Check the dimention first and will throw an warning if dimention
+ differ. Will use the mininum dimention for compute. Different method will
+ support in the future.
+
+ Args:
+ point1(numpy.ndarray):One point for compute
+ point2(numpy.ndarray):The other point for compute
+ method(str):The way to compute distance
+
+ Return:
+ float: distance between two points
+ """
+
+ dis = 0
+ point1 = point1.reshape(-1, 1)
+ point2 = point2.reshape(-1, 1)
+ if method == "Euclidean":
+ for iter in range(self.dimention):
+ dis += (point1[iter]-point2[iter]) ** 2
+ return dis ** 0.5
+
+ def update_center(self):
+ """use the class_point_dict to update the cluster_center
+
+ Return:
+ boolean:stand for whether center update or not
+ """
+
+ result = True
+ for iter in range(self.n_clusters):
+ temp = np.zeros(self.dimention)
+ for point in self.class_point_dict[iter]:
+ temp = temp + point
+ temp = temp / len(self.class_point_dict[iter])
+ result = result and (temp == self.cluster_center[iter]).all()
+ self.cluster_center[iter] = temp
+ return result
+
+
+class GaussianMixture:
+ """Gaussian mixture model for clustering
+
+ Attributes:
+ n_clusters(int): number of clusters
+ pi(numpy.ndarray): probability for all the clusters
+ cov(dict): covariance matrix for each cluster
+ mean(dict): mean matrix for each cluster
+ gamma(numpy.ndarray): gamma in EM algorithm
+ epsilon(int): lowbound of ending threshold
+ dimention(int): dimention of data
+ """
+
+ def __init__(self, n_clusters):
+ """init the parameter for this model
+
+ Randomly init the probability.
+ Init cluster number with parameter
+
+ Args:
+ n_clusters(int): number of clusters
+ """
+
+ self.pi = np.ones(n_clusters)/n_clusters
+ self.n_clusters = n_clusters
+ self.cov = {}
+ self.mean = {}
+ self.gamma = None
+ self.epsilon = 1e-20
+ self.dimention = 0
+
+ def fit(self, train_data):
+ """Train the model, using EM
+
+ Init the mean and covariance for each class first. The initial covariance matrix
+ for every cluster is unit matrix, and kmeans to init the mean of each class
+ In E step, calculate the probability for every point in every cluster.
+ In M step, update the parameter for every distribution
+
+ Args:
+ train_data(numpy.ndarray): data for train
+ """
+
+ k_model = KMeans(self.n_clusters)
+ k_model.fit(train_data)
+ train_data = data_preprocess(train_data)
+ train_num = train_data.shape[0]
+ if len(train_data.shape) == 1:
+ self.dimention = 1
+ else:
+ self.dimention = train_data.shape[1]
+ for iter in range(self.n_clusters):
+ self.mean[iter] = k_model.cluster_center[iter]
+ self.cov[iter] = np.ones(self.dimention)
+ self.cov[iter] = np.diag(self.cov[iter])
+ self.gamma = np.empty([train_num, self.n_clusters])
+ for i in range(20):
+ #E step
+ for iter in range(train_num):
+ temp = np.array(self.point_probability(train_data[iter]))
+ self.gamma[iter, :] = temp
+ log_h_w = self.gamma
+ self.gamma = self.gamma/self.gamma.sum(axis=1).reshape(-1, 1)
+ #termination condition
+
+ #M step
+ self.pi = np.sum(self.gamma, axis=0)/train_num
+ for label in range(self.n_clusters):
+ mean = np.zeros(self.dimention)
+ cov = np.zeros([self.dimention, self.dimention])
+ for iter in range(train_num):
+ mean += self.gamma[iter, label] * train_data[iter]
+ point = train_data[iter].reshape(-1, 1)
+ label_mean = self.mean[label].reshape(-1, 1)
+ rest = point - label_mean
+ cov += self.gamma[iter, label] * np.matmul(rest, rest.T)
+ self.mean[label] = mean/np.sum(self.gamma, axis=0)[label]
+ self.cov[label] = cov/np.sum(self.gamma, axis=0)[label]
+
+ def predict(self, test_data):
+ """Predict label
+
+ Args:
+ test_data(numpy.ndarray): Data for test
+
+ Return:
+ numpy.ndarray: each label of test_data
+ """
+
+ test_data = data_preprocess(test_data)
+ edge = max(abs(test_data.max()), abs(test_data.min()))
+ test_data = (test_data*10)/edge
+ result = []
+ for iter in test_data:
+ temp = self.point_probability(iter)
+ label = temp.index(max(temp))
+ result.append(label)
+ return np.array(result)
+
+ def point_probability(self, point):
+ """calculate the probability of every gaussian distribution
+
+ Args:
+ point(numpy.ndarray): point need to be calculated
+
+ Return:
+ list: probability for every distribution
+ """
+
+ result = []
+ for iter in range(self.n_clusters):
+ result.append(self.calculate(point, iter) * self.pi[iter])
+ return result
+
+ def calculate(self, point, iter):
+ """calculate the probability for the iter-th distribution
+
+ Args:
+ point(numpy.ndarray): the point need to calculate
+ iter(int): the number of distribution
+
+ Return:
+ float: the probability of the point
+
+ """
+
+ point = point.reshape(-1, 1)
+ mean = self.mean[iter]
+ mean = mean.reshape(-1, 1)
+ cov = self.cov[iter]
+ cov = np.matrix(cov)
+ D = self.dimention
+ coef = 1/((2*math.pi) ** (D/2) * (np.linalg.det(cov))**0.5)
+ pow = -0.5 * np.matmul(np.matmul((point - mean).T, cov.I), (point - mean))
+ result = coef * np.exp(pow) + np.exp(-200)
+ return float(result)
+
+
+class ClusteringAlgorithm:
+ """Auto cluster
+
+ Automatically choose k and cluster the data, using Kmeans
+
+ Attribute:
+ K(int): the number of clusters
+ cluster(class): the clusterer
+ """
+
+ def __init__(self):
+ """init the clusterer
+ """
+
+ self.K = 3
+ self.clusterer = None
+
+ def fit(self, train_data, method="Elbow"):
+ """train the cluster
+
+ Automatically choos the number of clusters with given method
+ For Elbow, we think if the difference between K-1, K, K+1 satisfies
+ [K-1] -[K] <= 2*([k]- [k+1]), then the graph is smooth enough.
+ will support Canopy in the future
+
+ Args:
+ train_data(numpy.ndarray): the dataset for training
+ method(str): will support Elbow and Canopy
+ """
+
+ if method == "Elbow":
+ train_num = train_data.shape[0]
+ upbound = int((train_num/2)**0.5)+2
+ dis = np.zeros(upbound)
+ for i in range(1, upbound):
+ self.clusterer = KMeans(i)
+ self.clusterer.fit(train_data)
+ label_dict = self.clusterer.class_point_dict
+ center = self.clusterer.cluster_center
+ for iter in range(i):
+ for point in label_dict[iter]:
+ dis[i] = dis[i]+self.clusterer.get_distance(point,center[iter])
+ dis[0] = 6 * dis[1]
+ if upbound <= 5:
+ min = 1
+ for i in range(1, upbound):
+ if dis[i] <= dis[min]:
+ min = i
+ else:
+ for min in range(1, upbound-1):
+ if dis[min-1]-dis[min] <= 2 * (dis[min]-dis[min+1]):
+ break
+ self.clusterer = KMeans(min)
+ self.clusterer.fit(train_data)
+ print("choose {}".format(min))
+ elif method == "Canopy":
+ canopy = Canopy()
+ K = len(canopy.fit(train_data))
+ print("choose {}".format(K))
+ self.clusterer = KMeans(K)
+ self.clusterer.fit(train_data)
+
+
+ def predict(self, test_data):
+ """predict the test_data
+
+ Args:
+ test_data(numpy.ndarray): data for test
+
+ Return:
+ list: label for each point
+ """
+
+ return self.clusterer.predict(test_data)
+
+
+if __name__ == "__main__":
+ mean_list = [(1, 2), (16, -5), (10, 22)]
+ cov_list = [np.array([[73, 0], [0, 22]]), np.array([[21.2, 0], [0, 32.1]]), np.array([[10, 5], [5, 10]])]
+ num_list = [80, 80, 80]
+ save_path = ""
+ data_generate_and_save(3, mean_list, cov_list, num_list, save_path)
+ (train_data, train_label), (test_data, test_label) = data_load()
+ visualize(train_data, dimention=2, class_num=3)
+ visualize(train_data, dimention=2, label=train_label, class_num=3, color = True)
+ # print(train_data)
+ # print(type(train_data))
+ # print(train_data.shape)
+ k = KMeans(3)
+ k.fit(train_data)
+ label1 = k.predict(train_data)
+ visualize(train_data, dimention=2, label=label1, class_num=3, color=True)
+ print(compute_SC(train_data, label1, 3))
+
+ g = GaussianMixture(3)
+ g.fit(train_data)
+ label2 = g.predict(train_data)
+ visualize(train_data, label=label2, dimention=2, class_num=3, color=True)
+ print(compute_SC(train_data, label2, 3))
+
+ # print(compute_SC(train_data, train_label, 3))
+ # k = ClusteringAlgorithm()
+ # k.fit(train_data, method="Elbow")
+ # k.predict(train_data)
+ # e = ClusteringAlgorithm()
+ # e.fit(train_data, method="Canopy")
+
diff --git a/assignment-3/submission/18307130116/tester_demo.py b/assignment-3/submission/18307130116/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ec0e8091691d4aaaa6b53dbb695fde9e826d89
--- /dev/null
+++ b/assignment-3/submission/18307130116/tester_demo.py
@@ -0,0 +1,117 @@
+import numpy as np
+import sys
+
+from source import KMeans, GaussianMixture
+
+
+def shuffle(*datas):
+ data = np.concatenate(datas)
+ label = np.concatenate([
+ np.ones((d.shape[0],), dtype=int)*i
+ for (i, d) in enumerate(datas)
+ ])
+ N = data.shape[0]
+ idx = np.arange(N)
+ np.random.shuffle(idx)
+ data = data[idx]
+ label = label[idx]
+ return data, label
+
+
+def data_1():
+ mean = (1, 2)
+ cov = np.array([[73, 0], [0, 22]])
+ x = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (16, -5)
+ cov = np.array([[21.2, 0], [0, 32.1]])
+ y = np.random.multivariate_normal(mean, cov, (200,))
+
+ mean = (10, 22)
+ cov = np.array([[10, 5], [5, 10]])
+ z = np.random.multivariate_normal(mean, cov, (1000,))
+
+ data, _ = shuffle(x, y, z)
+ return (data, data), 3
+
+
+def data_2():
+ train_data = np.array([
+ [23, 12, 173, 2134],
+ [99, -12, -126, -31],
+ [55, -145, -123, -342],
+ ])
+ return (train_data, train_data), 2
+
+
+def data_3():
+ train_data = np.array([
+ [23],
+ [-2999],
+ [-2955],
+ ])
+ return (train_data, train_data), 2
+
+
+def test_with_n_clusters(data_fuction, algorithm_class):
+ (train_data, test_data), n_clusters = data_fuction()
+ model = algorithm_class(n_clusters)
+ model.fit(train_data)
+ res = model.predict(test_data)
+ assert len(
+ res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong"
+ return res
+
+
+def testcase_1_1():
+ test_with_n_clusters(data_1, KMeans)
+ return True
+
+
+def testcase_1_2():
+ res = test_with_n_clusters(data_2, KMeans)
+ return res[0] != res[1] and res[1] == res[2]
+
+
+def testcase_2_1():
+ test_with_n_clusters(data_1, GaussianMixture)
+ return True
+
+
+def testcase_2_2():
+ res = test_with_n_clusters(data_3, GaussianMixture)
+ return res[0] != res[1] and res[1] == res[2]
+
+
+def test_all(err_report=False):
+ testcases = [
+ ["KMeans-1", testcase_1_1, 4],
+ ["KMeans-2", testcase_1_2, 4],
+ # ["KMeans-3", testcase_1_3, 4],
+ # ["KMeans-4", testcase_1_4, 4],
+ # ["KMeans-5", testcase_1_5, 4],
+ ["GMM-1", testcase_2_1, 4],
+ ["GMM-2", testcase_2_2, 4],
+ # ["GMM-3", testcase_2_3, 4],
+ # ["GMM-4", testcase_2_4, 4],
+ # ["GMM-5", testcase_2_5, 4],
+ ]
+ sum_score = sum([case[2] for case in testcases])
+ score = 0
+ for case in testcases:
+ try:
+ res = case[2] if case[1]() else 0
+ except Exception as e:
+ if err_report:
+ print("Error [{}] occurs in {}".format(str(e), case[0]))
+ res = 0
+ score += res
+ print("+ {:14} {}/{}".format(case[0], res, case[2]))
+ print("{:16} {}/{}".format("FINAL SCORE", score, sum_score))
+
+
+if __name__ == "__main__":
+ if len(sys.argv) > 1 and sys.argv[1] == "--report":
+ test_all(True)
+ else:
+ test_all()