diff --git a/assignment-3/submission/18307130116/README.md b/assignment-3/submission/18307130116/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1d5392c675488da60bf167158cd3e6975362c4d4 --- /dev/null +++ b/assignment-3/submission/18307130116/README.md @@ -0,0 +1,244 @@ +# Assignment3实验报告 + +[toc] + +## 注释 + +本次实验中所有函数均配有详尽的google风格的代码注释,因此除重点函数外,将略去实验报告中对API的详细介绍 + +## 模型实现 + +### 数据预处理 + +为了避免极端值的出现,笔者将原先的数据各个维度投影到[-10, 10]区间,且等比例投影在欧式距离的情况下并不会影响到聚类结果,从而避免在计算概率值时出现指数运算溢出的情况 + +### Kmeans + +#### 原理与收敛条件 + +Kmeans的原理相对较为简单,在初始随机选中一些点作为聚类中心后,计算每个点对于聚类中心的距离,并归到最近的类中,更新聚类中心后重复该过程直至收敛。 + +本次实验的收敛条件采用聚类中心是否变化作为最终条件,当某一轮更新后,聚类中心不发生改变,认为收敛。由于Kmeans对初值敏感,为避免某些随机初始化情况下在一些点上震荡,同时给定循环轮数的上界。 + +#### 模型结构 + +除给定的`init`,`fit`和`predict`三个API外,为方便使用,增添了`get_class`,`get_distance`,`update_center`三个API方便使用,`get_class`给定某个坐标点,在当前的聚类中心下,获得对应的类,`get_distance`:给定两个坐标点,获得对应的距离,`update_center`:根据聚类情况更新聚类中心 + +### GMM + +#### 原理和问题 + +高斯混合模型的基本假设为点的分布为高斯分布的叠加,从而整个过程是利用多个高斯分布对样本点拟合 + +GMM同样**对于初值非常敏感**,另一方面GMM对于数据量要求也较高,可以想见在极端小的数据规模下,高斯分布拟合出的结果会是一个瘦高的类似于冲激函数的形态,这将意味着,较小的扰动都会带来较大的概率值变化,从而影响对数似然的敛散性,因此对于初始值选取,有较高的要求。 + +另一方面,GMM模型更新采用的为EM算法,E步在固定协方差和均值后,计算高维高斯概率值,并加和归一化后获得最终结果,其目标是在E步利用变分方法,获得一个对后验概率的拟合,使得似然和证据下界相同,并在M步给定了后验估计后,优化高斯分布的参数,从而优化似然值 + +#### 初始值选取 + +前面提到,GMM算法对初始值非常敏感,尤其是在少样本的情况下,GMM算法会在自变量小范围震荡时,引发较大的似然值改变。实际代码实现时,若采用似然改变阈值作为收敛条件,会出现无法跑出结果的情况。 + +笔者阅读SKlearn中GMM的源码,以及查找了相关资料,总结出了两种初始值选取方法 + +* 采用抽样法,对训练数据进行抽样,并采取平均结果作为初始值 +* 采用Kmeans预先训练,采用聚类中心作为高斯分布的均值 + +在本次实验中,采用第二种实现方法,经过实际试验测试,较好的改进了原先震荡的情况 + +#### 模型结构 + +除了给定的接口外,额外添加了`point_probability`和`calculate`两个api,其中`point_probabilty`用于计算点对于所有类的联合概率,且为避免出现0,在计算式添加$e^{-200}$,以完成平滑操作,`calculate`则计算点在某一类下的概率值 + +### ClusteringAlgorithm + +#### 实现原理 + +以Kmeans聚类方法为基础,clusteringAlgorithm主要完成的任务是筛选一个合适的聚类数K,将在后文采用两种方法完成 + +#### K值选取方法实现 + +##### 经验法 + +经验法为K值的经验公式,一般为$\sqrt{n/2}$,在本轮实验中,该数值作为了Elbow法的上界 + +##### Elbow + +Elbow方法主要寻找整个变化曲线中的拐点,对于如下图像,拐点处被认为是最合理的点,图中该点的类别为4,正确为3,大致在正确值附近 + +![elbow](img/elbow.png) + +在该点处,类别切分没有太细,且误差较小,一般考虑为较好的K值。在本次实验中,为了自动化寻找拐点,我们用数组`dis`记录下每个k的阈值,并且认为前一次下降的幅度如果小于后一次下降幅度的两倍,即达到了拐点处,具体选择情况见实验部分[自动化聚类算法实验结果](###自动化聚类算法实验结果) + +##### Canopy + +###### 原理 + +Canopy为一种粗粒度的聚类方法,其原理为给定一个两个阈值t1, t2,满足t1>t2,在初始选定点后,按照某种距离度量,与初始点距离小于t2的点有较大概率不会成为类中心,将其从数据集中删去,小于t1的类型的点则有较大概率为同类型点,本次实验中因为只需要c运行后的类别数量,对同类型的有哪些不做考虑。重复上述过程直到整个数据集为空 + +###### 实现与阈值设定 + +Canopy的实现相对较为简单,这里主要讨论t1和t2值的给定,出于自动化获取k值的考量,这里选取t1,t2也通过获得数据维度的方法直接给定,考虑到真实数据分布为[-10, 10],认为$\sqrt{维度数}*2$为t2,而t1选2*t2,且考虑到在聚类中,一直聚类到所有点全部被划分完成,一定程度上会受到离群点影响,且Canopy预处理过程只需要获取最终的K值,对每个点的聚类情况并不关心。这里更改终止条件为,**当80%的点都完成聚类后,将终止Canopy算法并返回**,80%的阈值设定为考虑到二八法则,即认为80%的点能够较真实的反映数据的总体情况,最终结果参见实验部分[自动化聚类算法实验结果](###自动化聚类算法实验结果) + +## 实验部分 + +### 数据生成与可视化 + +数据生成部分重复利用了[Assignment1](https://gitee.com/fnlp/prml-21-spring/blob/master/assignment-1/submission/18307130116/source.py#L94)的API,并在此基础上添加了代码注释以及method接口,可以生成对数正态分布和高维高斯,可视化部分除了在[Assignment1](https://gitee.com/fnlp/prml-21-spring/blob/master/assignment-1/submission/18307130116/source.py#L129)基础上添加了代码注释以外,还添加了新的color参数,因为在考虑到实际聚类问题实现并不清楚各个类的分布情况,不着色绘图有时是合理的 + +对应的API接口`data_generate_and_save`负责数据生成和保存,`data_load`负责数据载入,`visualize`负责数据可视化 + +### 聚类效果评估方法 + +除了直接观察法外,我们采用Silhouette Coefficient(轮廓系数,简称SC)作为聚类效果的衡量指标,SC的度量方式为计算一个点到其簇中其他点的平均距离,同该点到其他类所有点的平均距离最小值的比值,该比值介于[-1,1],且越趋近于1,表现越优 + +`compute_SC`接口只需给定所有的点和其标签,即可完成计算 + +### 基础实验 + +#### 常规情况 + +在这部分实验中,我们将简单生成一组数据,可视化Kmeans和GMM聚类结果,并和真实值相对比 + +参数如下表所示 + +| 类别 | 均值 | 方差 | 点数 | +| ---- | ------- | ---------------- | ---- | +| 1 | (1, -7) | [[1, 0], [0, 1]] | 800 | +| 2 | (1, -4) | [[1, 0], [0, 1]] | 800 | +| 3 | (1, 0) | [[1, 0], [0, 1]] | 800 | + +无着色图例和真实图例如下 + +基础实验题目基础实验答案 + +Kmeans获取的结果(左) 和 GMM获取的结果(右)如下图所示 + +Kmeans基础实验结果GMM基础实验结果 + +从图中可见,大体上模型都将其正确聚类,在细微表现上有所差异 + +计算其SC值,Kmeans = 0.497,GMM = 0.497,两者相差不大,Kmeans略优于GMM + +#### 数据中存在极端值 + +当数据中存在极端值时,即考虑某个维度较大,进行相关实验,参数如下 + +| 类别 | 均值 | 方差 | 点数 | +| ---- | --------- | ------------------ | ---- | +| 1 | (1, -700) | [[1, 0], [0, 100]] | 800 | +| 2 | (1, -400) | [[1, 0], [0, 100]] | 800 | +| 3 | (1, 0) | [[1, 0], [0, 1]] | 800 | + +该部分实际测试为测试极端值是否会影响模型计算,实际效果上,属于各类区别很大的情况 + +无着色图例和真实图例如下 + +基础实验题目基础实验答案 + +Kmeans获取的结果(左) 和 GMM获取的结果(右)如下图所示 + +Kmeans基础实验结果GMM基础实验结果 + +计算其SC值,Kmeans = 0.973,GMM = 0.973,两者完全相同 + +#### 高度重叠下的聚类结果 + +以上我们已经测试了重叠不大,无重叠两组数据,进一步比较,我们将测试高度重叠下的聚类结果 + +采用如下参数 + +| 类别 | 均值 | 方差 | 点数 | +| ---- | ------ | ---------------- | ---- | +| 1 | (1, 1) | [[2, 0], [0, 2]] | 800 | +| 2 | (1, 2) | [[2, 0], [0, 2]] | 800 | +| 3 | (1, 0) | [[2, 0], [0, 2]] | 800 | + +结果如下图所示 + +无着色图例和真实图例如下 + +基础实验题目基础实验答案 + +Kmeans获取的结果(左) 和 GMM获取的结果(右)如下图所示 + +Kmeans基础实验结果GMM基础实验结果 + +计算其SC值,Kmeans = 0.323,GMM = 0.310,Kmeans略优于GMM,我们同时计算了高度重叠下的原始数据SC,为-0.003,聚类器聚类结果相比较原始类别更优,但是在实际操作过程中,这样可能具有欺诈性,因此在使用聚类器时,应该首先对数据分布有所了解,而不应该直接使用聚类器尝试,否则可能得到“评价较优”的错误结果 + +### GMM模型补充实验 + +我们GMM在进行拟合时,初值使用了Kmeans获取,考虑到GMM的意义不止在于聚类,更在于能够估算叠加形式的高斯分布参数,我们进一步尝试了非高斯分布在GMM下的分类情况,仅作为补充了解,非高斯分布在GMM下会有怎样的表现 + + + +基础实验题目基础实验答案 + +Kmeans获取的结果(左) 和 GMM获取的结果(右)如下图所示 + +Kmeans基础实验结果GMM基础实验结果 + +计算其SC值,Kmeans = 0.677,GMM = 0.62009476,原数据的SC为-0.0348,对数正态会被GMM理解成高度重叠的高斯分布,并没有对实际效果产生过大的影响 + +### 自动化聚类算法实验结果 + +ClusteringAlgorithm自动化实现是基于Kmeans完成的,因此在以下实验中,我们仅关注K值选取,而不重复进行Kmeans聚类效果实验 + +我们首先根据某一组参数,多轮测试基于ELBOW算法和Canopy算法的结果,真实值为K = 3,数据参数见表格,选取某一轮可视化如下图,加号为测试数据,有类别版本见右侧 + +无类别有类别data + +| 类别 | 均值 | 方差 | 点数 | +| ---- | -------- | ---------------------- | ---- | +| 1 | (1, 2) | [[73, 0], [0, 22]] | 800 | +| 2 | (16, -5) | [[21.2, 0], [0, 32.1]] | 200 | +| 3 | (10, 22) | [[10, 5], [5, 10]] | 1000 | + +| 轮数 | ELBOW自动获取的K | Canopy自动获取的K | +| ---- | ---------------- | ----------------- | +| 1 | 5 | 4 | +| 2 | 4 | 4 | +| 3 | 4 | 5 | +| 4 | 3 | 5 | +| 5 | 4 | 4 | + +可以看出,ELBOW和Canopy算法获得K都和真实值相差不大,较好的完成了自动化的工作,相比之下,ELBOW需要跑$\sqrt{n/2}$轮Kmeans,而Canopy算法只需要跑一轮即可,速度相对较快,ELBOW算法的波动情况更突出,能否取得最优结果取决于Kmeans过程中的随机初值,但总的来说完成情况较好 + +### Canopy鲁棒性测试 + +根据上文结果,ELBOW和Canopy算法获得的K值结果相差不大,且Canopy算法速度远快于ELBOW,但是由于依赖于阈值t1,t2,而在本次实验,阈值是对于同一维度数据完全相同的,因此增添鲁棒性测试。而ELBOW算法为暴力枚举K,以获得最好的结果,相对而言可以预想到鲁棒性较好,这里略去Kmeans的测试 + +为保证阈值相同,笔者测试了多组二维数据,参数和结果如下表格所示,增添第三组几乎不相交组和第四组高度重叠组,测试实际效果 + +| 轮数 | 均值 | 协方差 | 各类别点数 | 真实类别数 | 结果 | +| ---- | ------------------------------ | ----------------------------------------------------------- | --------------- | ---------- | ---- | +| 1 | [(1, -7), (1, -4), (1, 0)] | 单位阵 | [800, 800, 800] | 3 | 3 | +| 2 | [(1, -7), (1, -4), (1, 0)] | [[10, 0],[0, 1]] [[1, 0], [0, 10]] [[2, 0], [6, 5]] | [800, 800, 800] | 3 | 3 | +| 3 | [(10, 10), (-10, -10), (5, 0)] | 单位阵 | [800, 800, 800] | 3 | 3 | +| 4 | [(1, 1), (1, 2), (1, 0)] | [[2, 0],[0, 2]] [[2, 0], [0, 2]] [[2, 0], [0, 2]] | [800, 800, 800] | 3 | 4 | + +高度重叠组和完全分离数据可视化如下: + +高度重叠数据高度重叠数据 + +在高度重叠的数据下,Canopy方法仍表现出较好的性能,模型对分布鲁棒 + +紧接着,我们测试模型对分布点数的鲁棒性,全部采取上表中的第一组参数,仅更改各类别点数 + +| 轮数 | 点数 | 结果 | +| ---- | ---------------- | ---- | +| 1 | [80, 1000, 80] | 1 | +| 2 | [200, 1000, 10] | 2 | +| 3 | [200, 1000, 100] | 3 | + +可以看到点数对Canopy的影响较大,考虑到实际实现过程中,采用二八法则忽略少量离群点对于整体聚类数量的影响,在某个类别显著高于其他类时,其他类别的点将会被忽略,且距离越近忽略的可能性越大,这点也符合实际聚类时的特征 + +进一步的,我们测试多类情况下的K值选取,[a,b]表示多组测试下,K值最终落入区间[a, b] + +| 轮数 | 均值 | 协方差 | 各类别点数 | 真实类别数 | 结果 | +| ---- | ----------------------------------- | ------------------------------------------------------------ | --------------- | ---------- | ----- | +| 1 | [(1, -7), (1, -4), (1, 0), (-2, 0)] | 单位阵 | [800, 800, 800] | 4 | [3,4] | +| 2 | [(1, -7), (1, -4), (1, 0), (-2, 0)] | [[10, 0],[0, 1]] [[1, 0], [0, 10]] [[2, 0], [6, 5]] [[3, 0], [1, 5]] | [800, 800, 800] | 4 | [3,4] | + +#### 总结 + +在这部分中,我们看到,Canopy的鲁棒性对于分布情况较好,在重叠程度较大和较小时都有较好的性能,但是对于各类点数鲁棒性较差,容易忽略一些少量点,这与实现方式相关,也符合预期 \ No newline at end of file diff --git "a/assignment-3/submission/18307130116/img/GMM\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" "b/assignment-3/submission/18307130116/img/GMM\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" new file mode 100644 index 0000000000000000000000000000000000000000..c369d0d437a06ee47a31c8aa28fc5b2360e17ea0 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/GMM\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" differ diff --git "a/assignment-3/submission/18307130116/img/GMM\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203.png" "b/assignment-3/submission/18307130116/img/GMM\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203.png" new file mode 100644 index 0000000000000000000000000000000000000000..5577014148822a67b7cff48ba5e73e5e5ddcff73 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/GMM\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203.png" differ diff --git "a/assignment-3/submission/18307130116/img/GMM\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/GMM\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..40088f3cb594956cd7f3e9ccc6705ccf45827db3 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/GMM\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" differ diff --git "a/assignment-3/submission/18307130116/img/GMM\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/GMM\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..7294753f3d2219aece852691320d7f64611eb510 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/GMM\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" differ diff --git "a/assignment-3/submission/18307130116/img/Kmeans\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" "b/assignment-3/submission/18307130116/img/Kmeans\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" new file mode 100644 index 0000000000000000000000000000000000000000..d1aa314ef755065b8965ab3b811c8af5325ee35e Binary files /dev/null and "b/assignment-3/submission/18307130116/img/Kmeans\345\237\272\347\241\200\345\256\236\351\252\214\347\273\223\346\236\234.png" differ diff --git "a/assignment-3/submission/18307130116/img/Kmeans\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/Kmeans\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..ecff0a29f27fcd79d94e5f8195b6752aef9fba0b Binary files /dev/null and "b/assignment-3/submission/18307130116/img/Kmeans\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" differ diff --git "a/assignment-3/submission/18307130116/img/Kmeans\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/Kmeans\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..dab91ad6746eecb423c695a38749c1bb5fc0de42 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/Kmeans\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" differ diff --git "a/assignment-3/submission/18307130116/img/Kmeans\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/Kmeans\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..55b8377dc1f11b74d71c0eadd77dd40a5c4df74d Binary files /dev/null and "b/assignment-3/submission/18307130116/img/Kmeans\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" differ diff --git a/assignment-3/submission/18307130116/img/elbow.png b/assignment-3/submission/18307130116/img/elbow.png new file mode 100644 index 0000000000000000000000000000000000000000..f5a2e5e471a9f91630b179c9d9fa2fc478967176 Binary files /dev/null and b/assignment-3/submission/18307130116/img/elbow.png differ diff --git "a/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..79e2ea68d5a4d211a87599c3267bd2e48cf0daa4 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\347\255\224\346\241\210.png" differ diff --git "a/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\351\242\230\347\233\256.png" "b/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\351\242\230\347\233\256.png" new file mode 100644 index 0000000000000000000000000000000000000000..90c9da70c56b6adb825c55a7372aa37693a66dec Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\237\272\347\241\200\345\256\236\351\252\214\351\242\230\347\233\256.png" differ diff --git "a/assignment-3/submission/18307130116/img/\345\256\214\345\205\250\345\210\206\347\246\273\346\225\260\346\215\256.png" "b/assignment-3/submission/18307130116/img/\345\256\214\345\205\250\345\210\206\347\246\273\346\225\260\346\215\256.png" new file mode 100644 index 0000000000000000000000000000000000000000..572bf49a5ceaa1b1ce3e2493d81be2553c621e6d Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\256\214\345\205\250\345\210\206\347\246\273\346\225\260\346\215\256.png" differ diff --git "a/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..d5c27c4b1854debe84105d77a636ea0119368383 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\347\255\224\346\241\210.png" differ diff --git "a/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\351\242\230\347\233\256.png" "b/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\351\242\230\347\233\256.png" new file mode 100644 index 0000000000000000000000000000000000000000..1fb0fc17af5f4239734c8572df31d2f4ae68537f Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\345\257\271\346\225\260\346\255\243\346\200\201\345\210\206\345\270\203\351\242\230\347\233\256.png" differ diff --git "a/assignment-3/submission/18307130116/img/\346\227\240\347\261\273\345\210\253.png" "b/assignment-3/submission/18307130116/img/\346\227\240\347\261\273\345\210\253.png" new file mode 100644 index 0000000000000000000000000000000000000000..144f73c866afc1c56cc6d50b5e9c01b342d8da2f Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\346\227\240\347\261\273\345\210\253.png" differ diff --git "a/assignment-3/submission/18307130116/img/\346\234\211\347\261\273\345\210\253data.png" "b/assignment-3/submission/18307130116/img/\346\234\211\347\261\273\345\210\253data.png" new file mode 100644 index 0000000000000000000000000000000000000000..6d97fae31b24e72ecb5f4b19c52faa03636fbdf8 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\346\234\211\347\261\273\345\210\253data.png" differ diff --git "a/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..dab91ad6746eecb423c695a38749c1bb5fc0de42 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\347\255\224\346\241\210.png" differ diff --git "a/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\351\242\230\347\233\256.png" "b/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\351\242\230\347\233\256.png" new file mode 100644 index 0000000000000000000000000000000000000000..35a9a8754cd5d6d4e8ca3b3d8799f0110dc46f68 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\346\236\201\347\253\257\345\256\236\351\252\214\351\242\230\347\233\256.png" differ diff --git "a/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\346\225\260\346\215\256.png" "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\346\225\260\346\215\256.png" new file mode 100644 index 0000000000000000000000000000000000000000..74fafa0211602f3c8c6682127e3196cfa97bde11 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\346\225\260\346\215\256.png" differ diff --git "a/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" new file mode 100644 index 0000000000000000000000000000000000000000..2fe7b8b7aaae54eebe4001233150cba346870577 Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\347\255\224\346\241\210.png" differ diff --git "a/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\351\242\230\347\233\256.png" "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\351\242\230\347\233\256.png" new file mode 100644 index 0000000000000000000000000000000000000000..a9ff7f42e30d93a44bff57b9facb19111a2b75bc Binary files /dev/null and "b/assignment-3/submission/18307130116/img/\351\253\230\345\272\246\351\207\215\345\217\240\351\242\230\347\233\256.png" differ diff --git a/assignment-3/submission/18307130116/source.py b/assignment-3/submission/18307130116/source.py new file mode 100644 index 0000000000000000000000000000000000000000..5487cd0c73af70ba038bd590a54d5cfa3c8c080a --- /dev/null +++ b/assignment-3/submission/18307130116/source.py @@ -0,0 +1,649 @@ +import numpy as np +import random +import matplotlib.pyplot as plt +import math +import matplotlib.cm as cm +from numpy.core.fromnumeric import argmin + +def data_preprocess(data): + """preprocess the data + + use the range of data to transform the data + + Args: + data(numpy.ndarray):raw data + + Return: + numpy.ndarray: data after process + """ + + edge = max(abs(data.max()), abs(data.min())) + result_data = (data*10)/edge + return result_data + + +def compute_SC(data, label, class_num): + """compute the Silhouette Coefficient + + Args: + data(numpy.ndarray): data for compute + lable(list):label for every point + class_num(int): the number of cluster + + Return: + int: the value of Silhouette Coefficient + """ + point_dict = {} + data = data_preprocess(data) + if len(data.shape) == 1: + dimention = 1 + else: + dimention = data.shape[1] + for iter in range(class_num): + point_dict[iter] = [] + for iter in range(len(data)): + point_dict[label[iter]].append(data[iter]) + result = 0 + for iter in range(len(data)): + now_point = data[iter] + now_point = now_point.reshape(-1, 1) + inner_dis = 0 + now_label = label[iter] + for other in point_dict[now_label]: + other = other.reshape(-1, 1) + temp = 0 + for i in range(dimention): + temp = temp + (now_point[i]-other[i]) ** 2 + inner_dis = inner_dis + temp**0.5 + inner_dis = inner_dis / (len(point_dict[now_label]) - 1) + out_dis_min = math.inf + for label_iter in range(class_num): + if label_iter == now_label: + continue + out_dis = 0 + for other in point_dict[label_iter]: + other = other.reshape(-1, 1) + temp = 0 + for i in range(dimention): + temp = temp + (now_point[i]-other[i]) ** 2 + out_dis = out_dis + temp**0.5 + out_dis = out_dis / len(point_dict[label_iter]) + if out_dis < out_dis_min: + out_dis_min = out_dis + result = result + (out_dis_min - inner_dis)/max(out_dis_min, inner_dis) + result = result / len(data) + return result + + +def data_generate_and_save(class_num, mean_list, cov_list, num_list, save_path = "", method = "Gaussian"): + """generate data that obey Guassian distribution + + label will be saved in the meantime. + + Args: + class_num(int): the number of class + mean_list(list): mean_list[i] stand for the mean of class[i] + cov_list(list): similar to mean_list, stand for the covariance + num_list(list): similar to mean_list, stand for the number of points in class[i] + save_path(str): the data storage path, end with slash. + method(str): the distribution data will follow, support gaussian and lognormal + """ + if method == "lognormal": + data = np.random.lognormal(mean_list[0], cov_list[0], num_list[0]) + elif method == "Gaussian": + data = np.random.multivariate_normal(mean_list[0], cov_list[0], (num_list[0],)) + label = np.zeros((num_list[0],),dtype=int) + total = num_list[0] + + for iter in range(1, class_num): + if method == "lognormal": + temp = np.random.lognormal(mean_list[0], cov_list[0], num_list[0]) + elif method == "Gaussian": + temp = np.random.multivariate_normal(mean_list[0], cov_list[0], (num_list[0],)) + label_temp = np.ones((num_list[iter],),dtype=int)*iter + data = np.concatenate([data, temp]) + label = np.concatenate([label, label_temp]) + total += num_list[iter] + + idx = np.arange(total) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + train_num = int(total * 0.8) + train_data = data[:train_num, ] + test_data = data[train_num:, ] + train_label = label[:train_num, ] + test_label = label[train_num:, ] + np.save(save_path+"data.npy", ((train_data, train_label), (test_data, test_label))) + + +def data_load(path = ""): + """load data from path given + + data should follow the format(data, label) + + Args: + path(str): the path data stored in + + Return: + tuple: stand for the data and label + """ + + (train_data, train_label), (test_data, test_label) = np.load(path+"data.npy",allow_pickle=True) + return (train_data, train_label), (test_data, test_label) + + +def visualize(data, label=None, dimention = 2, class_num = 1, test_data=np.array([None]), color=False): + """draw a scatter + + if you want to distinguish class with color, parameter color should be True. + It will distribute color for each class automatically. + The test data will be marked with the label of plus + + Args: + data(numpy.ndarray):train dataset + label(numpy.ndarray):label for train data + class_num(int):the number of clusters, only used when color = True + test_data(numpy.ndarray): test dataset + color(boolean): True if you want different for each class, otherwise false + dimention(int): the data dimention, should be 1 or2 + """ + + if color == True: + data_x = {} + data_y = {} + for iter in range(class_num): + data_x[iter] = [] + data_y[iter] = [] + if dimention == 2: + for iter in range(len(label)): + data_x[label[iter]].append(data[iter, 0]) + data_y[label[iter]].append(data[iter, 1]) + elif dimention == 1: + for iter in range(len(label)): + data_x[label[iter]].append(data[iter]) + data_y[label[iter]].append(0) + colors = cm.rainbow(np.linspace(0, 1, class_num)) + for class_idx, c in zip(range(class_num), colors): + plt.scatter(data_x[class_idx], data_y[class_idx], color=c) + if(test_data.any() != None): + if dimention == 2: + plt.scatter(test_data[:, 0], test_data[:, 1], marker='+') + elif dimention == 1: + plt.scatter(test_data, np.zeros(len(test_data)), marker='+') + else: + if dimention == 2: + plt.scatter(data[:, 0], data[:, 1], marker='o') + elif dimention == 1: + plt.scatter(data, np.zeros(len(data)), marker='o') + if(test_data.any() != None): + if dimention == 2: + plt.scatter(test_data[:, 0], test_data[:, 1], marker='+') + elif dimention == 1: + plt.scatter(test_data, np.zeros(len(test_data)), marker='+') + plt.show() + + +class Canopy: + """Canopy clustering method + + The model will init thereshold automatically according to + the dimention of dataset + A low accuracy method to get cluster number K for the whole dataset + + Attribute: + t1(int): stand for the first thershold + t2(int): stand for the second thereshold + dimention: dimention of data + """ + def __init__(self): + """init the whole model + """ + self.t1 = 0 + self.t2 = 0 + self.dimention = 0 + + def get_distance(self, point1, point2, method="Euclidean"): + """Compute the distance between two points + + Check the dimention first and will throw an warning if dimention + differ. Will use the mininum dimention for compute. Different method will + support in the future. + + Args: + point1(numpy.ndarray):One point for compute + point2(numpy.ndarray):The other point for compute + method(str):The way to compute distance + + Return: + float: distance between two points + """ + + dis = 0 + point1 = point1.reshape(-1, 1) + point2 = point2.reshape(-1, 1) + if method == "Euclidean": + for iter in range(self.dimention): + dis += (point1[iter]-point2[iter]) ** 2 + return dis ** 0.5 + + def fit(self, train_data): + """train the model + + Args: + train_data(numpy.ndarray): dataset for training + + Return: + list: contain the turple with the format of (center point, [points around]) + """ + + train_data = data_preprocess(train_data) + train_num = train_data.shape[0] + if len(train_data.shape) == 1: + self.dimention = 1 + else: + self.dimention = train_data.shape[1] + self.t2 = 2 * self.dimention**0.5 + self.t1 = 2 * self.t2 + result = [] + while len(train_data) >= 0.2 * train_num: + idx = random.randint(0, len(train_data) - 1) + center = train_data[idx] + point_list = [] + point_need_delete = [] + train_data = np.delete(train_data, idx, 0) + for iter in range(len(train_data)): + dis = self.get_distance(train_data[iter], center) + if dis < self.t2: + point_need_delete.append(iter) + elif dis < self.t1: + point_list.append(train_data[iter]) + result.append((center, point_list)) + train_data = np.delete(train_data, point_need_delete, 0) + return result + + +class KMeans: + """Kmeans Clustering Algorithm + + Attributes: + n_clusters(int): number of clusters + cluster_center(list): center of each cluster + class_point_dict(dict): point dict of each cluster + dimention(int): dimention of data + """ + + def __init__(self, n_clusters): + """Inits the clusterer + + Init the points dicts and number of clusters with empty. + Init the cluster number with argument + + Args: + n_clusters(int): number of clusters + """ + + self.n_clusters = n_clusters + self.cluster_center = [] + self.class_point_dict = {} + self.dimention = 0 + + def fit(self, train_data): + """Train the clusterer + + Get the dimention of data and random select the center of each cluster first. + Label each point with the label of cluster center nearest to it. + Loop until cluster center don't change + + Args: + train_data(numpy.ndarray): training data for this task + """ + + train_data = data_preprocess(train_data) + train_data_num = train_data.shape[0] + if len(train_data.shape) == 1: + self.dimention = 1 + else: + self.dimention = train_data.shape[1] + for iter in range(self.n_clusters): + self.class_point_dict[iter] = [] + idx = random.sample(range(train_data_num), self.n_clusters) + for iter in range(self.n_clusters): + self.cluster_center.append(train_data[idx[iter]]) + for iter in train_data: + label = self.get_class(iter) + self.class_point_dict[label].append(iter) + epoch = 0 + while not self.update_center() and epoch < 100: + for label in range(self.n_clusters): + self.class_point_dict[label] = [] + for iter in train_data: + label = self.get_class(iter) + self.class_point_dict[label].append(iter) + epoch = epoch + 1 + + def predict(self, test_data): + """Predict label + + Args: + test_data(numpy.ndarray): Data for test + + Return: + numpy.ndarray: each label of test_data + """ + + test_data = data_preprocess(test_data) + result = np.array([]) + for iter in test_data: + label = self.get_class(iter) + result = np.r_[result, np.array([label])] + return result + + def get_class(self, point): + """Get the point class according to center points + + Args: + point(numpy.ndarray): Point for found + + Return: + int: In range(clusters number), stand for the class + """ + + min_class = 0 + for iter in range(self.n_clusters): + temp = self.get_distance(self.cluster_center[iter], point) + if iter == 0: + min_dis = temp + else: + if min_dis > temp: + min_dis = temp + min_class = iter + return min_class + + def get_distance(self, point1, point2, method="Euclidean"): + """Compute the distance between two points + + Check the dimention first and will throw an warning if dimention + differ. Will use the mininum dimention for compute. Different method will + support in the future. + + Args: + point1(numpy.ndarray):One point for compute + point2(numpy.ndarray):The other point for compute + method(str):The way to compute distance + + Return: + float: distance between two points + """ + + dis = 0 + point1 = point1.reshape(-1, 1) + point2 = point2.reshape(-1, 1) + if method == "Euclidean": + for iter in range(self.dimention): + dis += (point1[iter]-point2[iter]) ** 2 + return dis ** 0.5 + + def update_center(self): + """use the class_point_dict to update the cluster_center + + Return: + boolean:stand for whether center update or not + """ + + result = True + for iter in range(self.n_clusters): + temp = np.zeros(self.dimention) + for point in self.class_point_dict[iter]: + temp = temp + point + temp = temp / len(self.class_point_dict[iter]) + result = result and (temp == self.cluster_center[iter]).all() + self.cluster_center[iter] = temp + return result + + +class GaussianMixture: + """Gaussian mixture model for clustering + + Attributes: + n_clusters(int): number of clusters + pi(numpy.ndarray): probability for all the clusters + cov(dict): covariance matrix for each cluster + mean(dict): mean matrix for each cluster + gamma(numpy.ndarray): gamma in EM algorithm + epsilon(int): lowbound of ending threshold + dimention(int): dimention of data + """ + + def __init__(self, n_clusters): + """init the parameter for this model + + Randomly init the probability. + Init cluster number with parameter + + Args: + n_clusters(int): number of clusters + """ + + self.pi = np.ones(n_clusters)/n_clusters + self.n_clusters = n_clusters + self.cov = {} + self.mean = {} + self.gamma = None + self.epsilon = 1e-20 + self.dimention = 0 + + def fit(self, train_data): + """Train the model, using EM + + Init the mean and covariance for each class first. The initial covariance matrix + for every cluster is unit matrix, and kmeans to init the mean of each class + In E step, calculate the probability for every point in every cluster. + In M step, update the parameter for every distribution + + Args: + train_data(numpy.ndarray): data for train + """ + + k_model = KMeans(self.n_clusters) + k_model.fit(train_data) + train_data = data_preprocess(train_data) + train_num = train_data.shape[0] + if len(train_data.shape) == 1: + self.dimention = 1 + else: + self.dimention = train_data.shape[1] + for iter in range(self.n_clusters): + self.mean[iter] = k_model.cluster_center[iter] + self.cov[iter] = np.ones(self.dimention) + self.cov[iter] = np.diag(self.cov[iter]) + self.gamma = np.empty([train_num, self.n_clusters]) + for i in range(20): + #E step + for iter in range(train_num): + temp = np.array(self.point_probability(train_data[iter])) + self.gamma[iter, :] = temp + log_h_w = self.gamma + self.gamma = self.gamma/self.gamma.sum(axis=1).reshape(-1, 1) + #termination condition + + #M step + self.pi = np.sum(self.gamma, axis=0)/train_num + for label in range(self.n_clusters): + mean = np.zeros(self.dimention) + cov = np.zeros([self.dimention, self.dimention]) + for iter in range(train_num): + mean += self.gamma[iter, label] * train_data[iter] + point = train_data[iter].reshape(-1, 1) + label_mean = self.mean[label].reshape(-1, 1) + rest = point - label_mean + cov += self.gamma[iter, label] * np.matmul(rest, rest.T) + self.mean[label] = mean/np.sum(self.gamma, axis=0)[label] + self.cov[label] = cov/np.sum(self.gamma, axis=0)[label] + + def predict(self, test_data): + """Predict label + + Args: + test_data(numpy.ndarray): Data for test + + Return: + numpy.ndarray: each label of test_data + """ + + test_data = data_preprocess(test_data) + edge = max(abs(test_data.max()), abs(test_data.min())) + test_data = (test_data*10)/edge + result = [] + for iter in test_data: + temp = self.point_probability(iter) + label = temp.index(max(temp)) + result.append(label) + return np.array(result) + + def point_probability(self, point): + """calculate the probability of every gaussian distribution + + Args: + point(numpy.ndarray): point need to be calculated + + Return: + list: probability for every distribution + """ + + result = [] + for iter in range(self.n_clusters): + result.append(self.calculate(point, iter) * self.pi[iter]) + return result + + def calculate(self, point, iter): + """calculate the probability for the iter-th distribution + + Args: + point(numpy.ndarray): the point need to calculate + iter(int): the number of distribution + + Return: + float: the probability of the point + + """ + + point = point.reshape(-1, 1) + mean = self.mean[iter] + mean = mean.reshape(-1, 1) + cov = self.cov[iter] + cov = np.matrix(cov) + D = self.dimention + coef = 1/((2*math.pi) ** (D/2) * (np.linalg.det(cov))**0.5) + pow = -0.5 * np.matmul(np.matmul((point - mean).T, cov.I), (point - mean)) + result = coef * np.exp(pow) + np.exp(-200) + return float(result) + + +class ClusteringAlgorithm: + """Auto cluster + + Automatically choose k and cluster the data, using Kmeans + + Attribute: + K(int): the number of clusters + cluster(class): the clusterer + """ + + def __init__(self): + """init the clusterer + """ + + self.K = 3 + self.clusterer = None + + def fit(self, train_data, method="Elbow"): + """train the cluster + + Automatically choos the number of clusters with given method + For Elbow, we think if the difference between K-1, K, K+1 satisfies + [K-1] -[K] <= 2*([k]- [k+1]), then the graph is smooth enough. + will support Canopy in the future + + Args: + train_data(numpy.ndarray): the dataset for training + method(str): will support Elbow and Canopy + """ + + if method == "Elbow": + train_num = train_data.shape[0] + upbound = int((train_num/2)**0.5)+2 + dis = np.zeros(upbound) + for i in range(1, upbound): + self.clusterer = KMeans(i) + self.clusterer.fit(train_data) + label_dict = self.clusterer.class_point_dict + center = self.clusterer.cluster_center + for iter in range(i): + for point in label_dict[iter]: + dis[i] = dis[i]+self.clusterer.get_distance(point,center[iter]) + dis[0] = 6 * dis[1] + if upbound <= 5: + min = 1 + for i in range(1, upbound): + if dis[i] <= dis[min]: + min = i + else: + for min in range(1, upbound-1): + if dis[min-1]-dis[min] <= 2 * (dis[min]-dis[min+1]): + break + self.clusterer = KMeans(min) + self.clusterer.fit(train_data) + print("choose {}".format(min)) + elif method == "Canopy": + canopy = Canopy() + K = len(canopy.fit(train_data)) + print("choose {}".format(K)) + self.clusterer = KMeans(K) + self.clusterer.fit(train_data) + + + def predict(self, test_data): + """predict the test_data + + Args: + test_data(numpy.ndarray): data for test + + Return: + list: label for each point + """ + + return self.clusterer.predict(test_data) + + +if __name__ == "__main__": + mean_list = [(1, 2), (16, -5), (10, 22)] + cov_list = [np.array([[73, 0], [0, 22]]), np.array([[21.2, 0], [0, 32.1]]), np.array([[10, 5], [5, 10]])] + num_list = [80, 80, 80] + save_path = "" + data_generate_and_save(3, mean_list, cov_list, num_list, save_path) + (train_data, train_label), (test_data, test_label) = data_load() + visualize(train_data, dimention=2, class_num=3) + visualize(train_data, dimention=2, label=train_label, class_num=3, color = True) + # print(train_data) + # print(type(train_data)) + # print(train_data.shape) + k = KMeans(3) + k.fit(train_data) + label1 = k.predict(train_data) + visualize(train_data, dimention=2, label=label1, class_num=3, color=True) + print(compute_SC(train_data, label1, 3)) + + g = GaussianMixture(3) + g.fit(train_data) + label2 = g.predict(train_data) + visualize(train_data, label=label2, dimention=2, class_num=3, color=True) + print(compute_SC(train_data, label2, 3)) + + # print(compute_SC(train_data, train_label, 3)) + # k = ClusteringAlgorithm() + # k.fit(train_data, method="Elbow") + # k.predict(train_data) + # e = ClusteringAlgorithm() + # e.fit(train_data, method="Canopy") + diff --git a/assignment-3/submission/18307130116/tester_demo.py b/assignment-3/submission/18307130116/tester_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..19ec0e8091691d4aaaa6b53dbb695fde9e826d89 --- /dev/null +++ b/assignment-3/submission/18307130116/tester_demo.py @@ -0,0 +1,117 @@ +import numpy as np +import sys + +from source import KMeans, GaussianMixture + + +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int)*i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + + +def data_1(): + mean = (1, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (800,)) + + mean = (16, -5) + cov = np.array([[21.2, 0], [0, 32.1]]) + y = np.random.multivariate_normal(mean, cov, (200,)) + + mean = (10, 22) + cov = np.array([[10, 5], [5, 10]]) + z = np.random.multivariate_normal(mean, cov, (1000,)) + + data, _ = shuffle(x, y, z) + return (data, data), 3 + + +def data_2(): + train_data = np.array([ + [23, 12, 173, 2134], + [99, -12, -126, -31], + [55, -145, -123, -342], + ]) + return (train_data, train_data), 2 + + +def data_3(): + train_data = np.array([ + [23], + [-2999], + [-2955], + ]) + return (train_data, train_data), 2 + + +def test_with_n_clusters(data_fuction, algorithm_class): + (train_data, test_data), n_clusters = data_fuction() + model = algorithm_class(n_clusters) + model.fit(train_data) + res = model.predict(test_data) + assert len( + res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong" + return res + + +def testcase_1_1(): + test_with_n_clusters(data_1, KMeans) + return True + + +def testcase_1_2(): + res = test_with_n_clusters(data_2, KMeans) + return res[0] != res[1] and res[1] == res[2] + + +def testcase_2_1(): + test_with_n_clusters(data_1, GaussianMixture) + return True + + +def testcase_2_2(): + res = test_with_n_clusters(data_3, GaussianMixture) + return res[0] != res[1] and res[1] == res[2] + + +def test_all(err_report=False): + testcases = [ + ["KMeans-1", testcase_1_1, 4], + ["KMeans-2", testcase_1_2, 4], + # ["KMeans-3", testcase_1_3, 4], + # ["KMeans-4", testcase_1_4, 4], + # ["KMeans-5", testcase_1_5, 4], + ["GMM-1", testcase_2_1, 4], + ["GMM-2", testcase_2_2, 4], + # ["GMM-3", testcase_2_3, 4], + # ["GMM-4", testcase_2_4, 4], + # ["GMM-5", testcase_2_5, 4], + ] + sum_score = sum([case[2] for case in testcases]) + score = 0 + for case in testcases: + try: + res = case[2] if case[1]() else 0 + except Exception as e: + if err_report: + print("Error [{}] occurs in {}".format(str(e), case[0])) + res = 0 + score += res + print("+ {:14} {}/{}".format(case[0], res, case[2])) + print("{:16} {}/{}".format("FINAL SCORE", score, sum_score)) + + +if __name__ == "__main__": + if len(sys.argv) > 1 and sys.argv[1] == "--report": + test_all(True) + else: + test_all()