diff --git a/assignment-3/handout/README.md b/assignment-3/handout/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e086659bd86f27b6eab9ffb5684e76f50641dde5 --- /dev/null +++ b/assignment-3/handout/README.md @@ -0,0 +1,43 @@ +# Assignment 3. 聚类算法 + +- **姓名:高庆麾** +- **学号:19307130062** + + + +## 第〇部分 算法实现介绍 + +*对 K-Means 和 GMM 算法在代码中已经包含比较详细的注释,对大部分实现过程和细节都在相应位置做了标注和说明,请参看代码。 + + + +## 第一部分 自动聚类算法实验 + +生成数据组数 $n = 500$,维数 $dim = 2$ ,生成采用的高斯分布标准差为 $1$ ,中心从 $[0,\ 1]$ 中随机生成,但做了乘 $10$ 倍放大的处理,以增大间距。 + +聚类数依次为 $3,\ 5,\ 7,\ 9,\ 11,\ 13$ 时的结果如下: + +xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3 + +如果仅通过肉眼观察,可以发现 elbow method 配合自行定义的一套选择拐点的方法(详细介绍请见代码对应部分注释)是非常符合人工观察的预期的,但是数据生成的聚类确实比较不规则,所在的位置在距离和上难以体现出来。 + + + +### 第二部分 对聚类结果的可视化实验 + +生成数据组数 $n = 500$,维数 $dim = 2$ ,生成采用的高斯分布标准差为 $1$ ,中心从 $[0,\ 1]$ 中随机生成,但做了乘 $10$ 倍放大的处理,以增大间距。 + +KMeans 算法在迭代次数 $20$, 聚类数依次为 $5,\ 9,\ 13$ 时的结果如下: +vhklkqwtkz 500, 2, 5vhklkqwtkz 500, 2, 5vhklkqwtkz 500, 2, 5 + +可以发现,在聚类数增多时,由于数据生成过于混杂,KMeans 给出了从肉眼观察上看更优的结果,尽管这可能和原始数据的生成方式有所区别。 + + + +如果我们适当增大各个聚类的间距,GMM 算法在迭代次数 $20$, 聚类数依次为 $5,\ 9,\ 13$ 时的结果如下: +vhklkqwtkz 500, 2, 5vhklkqwtkz 500, 2, 5vhklkqwtkz 500, 2, 5 + +可以看到很多时候,各个高斯分布的均值容易聚到一起,通过对算法的逻辑进行思考和分析,算法应该确实存在这方面的问题。 + + + diff --git a/assignment-3/submission/19307130062/README.md b/assignment-3/submission/19307130062/README.md new file mode 100644 index 0000000000000000000000000000000000000000..0dc22b88270f3e935c6160b60b0443b98021e005 --- /dev/null +++ b/assignment-3/submission/19307130062/README.md @@ -0,0 +1,45 @@ +# Assignment 3. 聚类算法 + +- **姓名:高庆麾** +- **学号:19307130062** + + + +## 第〇部分 算法实现介绍 + +*对 K-Means 和 GMM 算法在代码中已经包含比较详细的注释,对大部分实现过程和细节都在相应位置做了标注和说明,请参看代码。 + +实验实现了支持多维数据(大于两维)的高维 GMM 算法。详细介绍也请参看代码。 + + + +## 第一部分 自动聚类算法实验 + +生成数据组数 $n = 500$,维数 $dim = 2$ ,生成采用的高斯分布标准差为 $1$ ,中心从 $[0,\ 1]$ 中随机生成,但做了乘 $10$ 倍放大的处理,以增大间距。 + +聚类数依次为 $3,\ 5,\ 7,\ 9,\ 11,\ 13$ 时的结果如下: + +xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3xhynyscjpi Elbow Method ALL15 500, 2, 3 + +如果仅通过肉眼观察,可以发现 elbow method 配合自行定义的一套选择拐点的方法(详细介绍请见代码对应部分注释)是非常符合人工观察的预期的,但是数据生成的聚类确实比较不规则,所在的位置在距离和上难以体现出来。 + + + +## 第二部分 对聚类结果的可视化实验 + +生成数据组数 $n = 500$,维数 $dim = 2$ ,生成采用的高斯分布标准差为 $1$ ,中心从 $[0,\ 1]$ 中随机生成,但做了乘 $10$ 倍放大的处理,以增大间距。 + +KMeans 算法在迭代次数 $20$, 聚类数依次为 $5,\ 9,\ 13$ 时的结果如下: +vhklkqwtkz 500, 2, 5vhklkqwtkz 500, 2, 5vhklkqwtkz 500, 2, 5 + +可以发现,在聚类数增多时,由于数据生成过于混杂,KMeans 给出了从肉眼观察上看更优的结果,尽管这可能和原始数据的生成方式有所区别。 + + + +如果我们适当增大各个聚类的间距,GMM 算法在迭代次数 $20$, 聚类数依次为 $5,\ 9,\ 13$ 时的结果如下: +vhklkqwtkz 500, 2, 5vhklkqwtkz 500, 2, 5vhklkqwtkz 500, 2, 5 + +可以看到很多时候,各个高斯分布的均值容易聚到一起,通过对算法的逻辑进行思考和分析,算法应该确实存在这方面的问题。 + + + diff --git a/assignment-3/submission/19307130062/img/acgtcnevvn 500, 2, 9.png b/assignment-3/submission/19307130062/img/acgtcnevvn 500, 2, 9.png new file mode 100644 index 0000000000000000000000000000000000000000..ccd4f133638733bc11b883b1c4efb28c973cfcd0 Binary files /dev/null and b/assignment-3/submission/19307130062/img/acgtcnevvn 500, 2, 9.png differ diff --git a/assignment-3/submission/19307130062/img/boqdtamgmq Elbow Method ALL15 500, 2, 9.png b/assignment-3/submission/19307130062/img/boqdtamgmq Elbow Method ALL15 500, 2, 9.png new file mode 100644 index 0000000000000000000000000000000000000000..1703390178b74d5edfb1b6c91fcd2be15c73d2a6 Binary files /dev/null and b/assignment-3/submission/19307130062/img/boqdtamgmq Elbow Method ALL15 500, 2, 9.png differ diff --git a/assignment-3/submission/19307130062/img/heyhertslr Elbow Method ALL15 500, 2, 11.png b/assignment-3/submission/19307130062/img/heyhertslr Elbow Method ALL15 500, 2, 11.png new file mode 100644 index 0000000000000000000000000000000000000000..fb5190b7c07a13650459efc2310e3fdd4af99601 Binary files /dev/null and b/assignment-3/submission/19307130062/img/heyhertslr Elbow Method ALL15 500, 2, 11.png differ diff --git a/assignment-3/submission/19307130062/img/hkmxtjeiaj Elbow Method ALL15 500, 2, 5.png b/assignment-3/submission/19307130062/img/hkmxtjeiaj Elbow Method ALL15 500, 2, 5.png new file mode 100644 index 0000000000000000000000000000000000000000..c9ff4a6a358650703bf75f4ae2060a9614ea620e Binary files /dev/null and b/assignment-3/submission/19307130062/img/hkmxtjeiaj Elbow Method ALL15 500, 2, 5.png differ diff --git a/assignment-3/submission/19307130062/img/maddmqffwl Elbow Method ALL15 500, 2, 13.png b/assignment-3/submission/19307130062/img/maddmqffwl Elbow Method ALL15 500, 2, 13.png new file mode 100644 index 0000000000000000000000000000000000000000..6013b0c9b5fd5f5e196f4f1185cba797c0302fd4 Binary files /dev/null and b/assignment-3/submission/19307130062/img/maddmqffwl Elbow Method ALL15 500, 2, 13.png differ diff --git a/assignment-3/submission/19307130062/img/ndrdqqpmea Elbow Method ALL15 500, 2, 7.png b/assignment-3/submission/19307130062/img/ndrdqqpmea Elbow Method ALL15 500, 2, 7.png new file mode 100644 index 0000000000000000000000000000000000000000..8cda99abd6d831bce4fff3af8394eb04d13df80f Binary files /dev/null and b/assignment-3/submission/19307130062/img/ndrdqqpmea Elbow Method ALL15 500, 2, 7.png differ diff --git a/assignment-3/submission/19307130062/img/onaixzshct 500, 2, 13.png b/assignment-3/submission/19307130062/img/onaixzshct 500, 2, 13.png new file mode 100644 index 0000000000000000000000000000000000000000..dc6d5517c5e3011fd8ecab17756a6522ebb4a4b3 Binary files /dev/null and b/assignment-3/submission/19307130062/img/onaixzshct 500, 2, 13.png differ diff --git a/assignment-3/submission/19307130062/img/pbtjfkfwjw 500, 2, 13.png b/assignment-3/submission/19307130062/img/pbtjfkfwjw 500, 2, 13.png new file mode 100644 index 0000000000000000000000000000000000000000..dc61f97062a45b7458efa5426d9e2c3507c5fd8c Binary files /dev/null and b/assignment-3/submission/19307130062/img/pbtjfkfwjw 500, 2, 13.png differ diff --git a/assignment-3/submission/19307130062/img/uvqnndbvpu 500, 2, 9.png b/assignment-3/submission/19307130062/img/uvqnndbvpu 500, 2, 9.png new file mode 100644 index 0000000000000000000000000000000000000000..927d6bb4a872518e9e45bbaf8279535f841364c0 Binary files /dev/null and b/assignment-3/submission/19307130062/img/uvqnndbvpu 500, 2, 9.png differ diff --git a/assignment-3/submission/19307130062/img/uxcngpxjhl 500, 2, 5.png b/assignment-3/submission/19307130062/img/uxcngpxjhl 500, 2, 5.png new file mode 100644 index 0000000000000000000000000000000000000000..a69e8952e596f1e116461a938f1d0e0c54397ad4 Binary files /dev/null and b/assignment-3/submission/19307130062/img/uxcngpxjhl 500, 2, 5.png differ diff --git a/assignment-3/submission/19307130062/img/vhklkqwtkz 500, 2, 5.png b/assignment-3/submission/19307130062/img/vhklkqwtkz 500, 2, 5.png new file mode 100644 index 0000000000000000000000000000000000000000..bc886e28b1c714d7f7a84d32cb7f0a0eaf9de361 Binary files /dev/null and b/assignment-3/submission/19307130062/img/vhklkqwtkz 500, 2, 5.png differ diff --git a/assignment-3/submission/19307130062/img/wishwtfbzz 500, 2, 9.png b/assignment-3/submission/19307130062/img/wishwtfbzz 500, 2, 9.png new file mode 100644 index 0000000000000000000000000000000000000000..dc82075241af36966c6e0e73228fe28e0455a078 Binary files /dev/null and b/assignment-3/submission/19307130062/img/wishwtfbzz 500, 2, 9.png differ diff --git a/assignment-3/submission/19307130062/img/wuxvkrvwya 500, 2, 13.png b/assignment-3/submission/19307130062/img/wuxvkrvwya 500, 2, 13.png new file mode 100644 index 0000000000000000000000000000000000000000..8ba9029afec51a0b3be7804f4ec6dc2421c5521b Binary files /dev/null and b/assignment-3/submission/19307130062/img/wuxvkrvwya 500, 2, 13.png differ diff --git a/assignment-3/submission/19307130062/img/wyrclifhqj 500, 2, 5.png b/assignment-3/submission/19307130062/img/wyrclifhqj 500, 2, 5.png new file mode 100644 index 0000000000000000000000000000000000000000..de70dc270b7fa94f8ba14326119526d9ccaccf83 Binary files /dev/null and b/assignment-3/submission/19307130062/img/wyrclifhqj 500, 2, 5.png differ diff --git a/assignment-3/submission/19307130062/img/xhynyscjpi Elbow Method ALL15 500, 2, 3.png b/assignment-3/submission/19307130062/img/xhynyscjpi Elbow Method ALL15 500, 2, 3.png new file mode 100644 index 0000000000000000000000000000000000000000..d45b29105396db5fc6ee6942a5c8245434327ec8 Binary files /dev/null and b/assignment-3/submission/19307130062/img/xhynyscjpi Elbow Method ALL15 500, 2, 3.png differ diff --git a/assignment-3/submission/19307130062/source.py b/assignment-3/submission/19307130062/source.py new file mode 100644 index 0000000000000000000000000000000000000000..245ced90da1ec2939f98772a5013cb57bf1c4c1e --- /dev/null +++ b/assignment-3/submission/19307130062/source.py @@ -0,0 +1,456 @@ +import numpy as np + +def distance(x, y, type = "euclidean", lp = -1.): + """ + 支持多样化距离的计算,包括欧几里得距离、曼哈顿距离、切比雪夫距离 + 以及更一般的 Lp 范数(要求 p 为正实数,且不超过 100),操作不合法时 + 返回 0. + """ + + x = x.astype(np.float64) + y = y.astype(np.float64) + + if type == "euclidean": + return np.sqrt(sum((x - y)**2)) + + elif type == "manhattan": + return sum(abs(x - y)) + + elif type == "chebyshev": + return max(abs(x - y)) + + elif type == "Lp": + if 0 < lp <= 100.: + return sum((abs(x - y)) ** lp) ** (1. / lp) + else: + print("Error: Lp-norm is illegal.") + return 0. + + else: + print("Error: can't identify the type of distance.") + return 0. + + +def preprocess(x, type = "min_max"): + """ + 支持 Min-Max 或 Z-score 方法的数据归一化预处理 + """ + + x = x.astype(np.float64) + + if type == "min_max": + for dim in range(x.shape[1]): + dim_min, dim_max = min(x[:, dim]), max(x[:, dim]) + x[:, dim] = (x[:, dim] - dim_min) / (dim_max - dim_min) + return x + + elif type == "z_score": + for dim in range(x.shape[1]): + std, mu = np.std(x[:, dim]), np.mean(x[:, dim]) + x[:, dim] = (x[:, dim] - mu) / std + return x + + elif type == "none": + return x + +class KMeans: + + def __init__(self, n_clusters, preprocess_type = "min_max"): # 默认采用 Min-Max 归一化方法 + self.K = n_clusters + self.preprocess_type = preprocess_type + + def fit(self, train_data, iter_times = 10, debug = False): + + train_data = preprocess(train_data, type = self.preprocess_type) + n = train_data.shape[0] # 训练数据数 + cluster_size = np.zeros((self.K, 1)).astype(int) # 每个聚类中的点数,类型转换为整型 + train_label = np.zeros((n)).astype(int) # 每个点所属的聚类,类型转换为整型 + + # self.center = np.random.rand(self.K, train_data.shape[1]) + self.center = train_data[: self.K] # 取前 K 个样本作为初始聚类中心,应事先打乱数据 + + for T in range(iter_times): + + # print("Iteration {0}:\t".format(T + 1), end = '') + + sum_of_distance = 0. + + for i in range(n): + + min_dist = distance(train_data[i], self.center[self.K - 1]) + min_dist_cluster = self.K - 1 + + for j in range(self.K - 1): + dis = distance(train_data[i], self.center[j]) + if dis < min_dist: + min_dist = dis + min_dist_cluster = j + + train_label[i] = min_dist_cluster + sum_of_distance += min_dist + # 求距离每个点最近的聚类中心 + + self.center.fill(0) + cluster_size.fill(0) + + for i in range(n): + self.center[train_label[i]] += train_data[i] + cluster_size[train_label[i]] += 1 + + for j in range(self.K): + if cluster_size[j] == 0: + self.center[j] = np.random.rand(train_data.shape[1]) + # 如果该聚类中无任何数据点,则随机一个新的聚类中心 + else: + self.center[j] /= cluster_size[j] + # 更新新的聚类中心 + + + # print("{0}".format(sum_of_distance)) + + if debug: + return self.center, sum_of_distance + + def predict(self, test_data): + + test_data = preprocess(test_data, type = self.preprocess_type) + + n = test_data.shape[0] # 测试数据数 + test_label = np.zeros((n)).astype(int) # 每个点所属的聚类,类型转换为整型 + + for i in range(n): + + min_dist = distance(test_data[i], self.center[self.K - 1]) + min_dist_cluster = self.K - 1 + + for j in range(self.K - 1): + dis = distance(test_data[i], self.center[j]) + if dis < min_dist: + min_dist = dis + min_dist_cluster = j + + test_label[i] = min_dist_cluster + # 求距离每个点最近的聚类中心 + + sum_of_distance = 0 + for i in range(n): + sum_of_distance += distance(test_data[i], self.center[test_label[i]]) + # 计算各点到其聚类中心距离和 + + return test_label + +class GaussianMixture: + + def __init__(self, n_clusters, preprocess_type = "min_max"): # 默认采用 Min-Max 归一化方法 + self.K = n_clusters + self.preprocess_type = preprocess_type + + def fit(self, train_data, iter_times = 20, debug = False): # 支持多维高斯混合分布模型 + + train_data = preprocess(train_data, self.preprocess_type) + + dim = train_data.shape[1] # 训练数据维数 + n = train_data.shape[0] # 训练数据数 + gamma = np.zeros((n, self.K)) # GMM 中的参数,表达每个点属于每个聚类的可能性大小 + train_label = np.zeros((n)).astype(int) # 每个点所属的聚类,类型转换为整型 + + self.mu = np.random.rand(self.K, dim) # 高斯分布的均值数组 + self.cov = np.array([np.eye(dim)] * self.K) * 0.1 # 高斯分布的协方差矩阵数组 + self.alpha = np.array([1.0 / self.K] * self.K) # 高斯分布的隐变量数组 + self.epsilon = 1e-10 # 用于防止数值问题的常量 + + for T in range(iter_times): + gamma.fill(0) + + """ + 查阅许多网上资料及 scipy 有关源码后,发现要想快速地(用矩阵乘法的形式)计算多个向量在某个 + 多维高斯分布下的 gamma,需要将 cov 矩阵逆分解为两个互为转置的矩阵的乘积,分别和前后也互为 + 转置关系的 (x - mu) 和 (x - mu)^T 相乘,方能实现批量计算。但上述方法较为复杂,此处采用 + 每次计算一项 gamma 的暴力方法实现。 + + 为了尽可能避免数值问题,此处先求出 log(gamma),最后再进行指数计算将其复原。 + """ + + for k in range(self.K): + tail = np.log(self.alpha[k]) - 0.5 * (dim * np.log(2 * np.pi) + np.log(np.linalg.det(self.cov[k]))) + # 对含隐变量高斯分布取 log 后舍弃 exp 部分的余项,可以事先计算出来 + cov_inv = np.linalg.inv(self.cov[k]) + # 对协方差矩阵求逆 + + for i in range(n): + dev = train_data[i] - self.mu[k] + gamma[i, k] = -0.5 * (np.matmul(np.matmul(dev, cov_inv), dev.T)) + tail + # 计算 exp 部分的数值 + + gamma = np.exp(gamma) + # 通过 exp 还原 gamma + gamma /= (gamma.sum(axis = 1).reshape(gamma.shape[0], -1)) + # 对同一数据点做其在每个聚类上的 gamma 的归一化 + + self.mu = np.matmul(gamma.T, train_data) / gamma.sum(axis = 0).reshape(self.K, -1) + # 更新高斯分布的均值数组,此处的计算利用了广播机制 + for k in range(self.K): + dev = gamma[:, k].reshape(n, -1) * (train_data - self.mu[k]) #XXX + self.cov[k] = np.matmul(dev.T, train_data - self.mu[k]) / gamma[:, k].sum() #XXX + self.cov[k] += np.eye(dim) * self.epsilon # 在对角线加上 \lambda I ,使协方差矩阵保持非奇异性质 + # 更新高斯分布的协方差矩阵数组 + self.alpha = gamma.sum(axis = 0) / n + # 更新高斯分布的隐变量数组 + + for i in range(n): + train_label[i] = np.nonzero((gamma[i, :] == gamma[i, :].max()))[0] + # 计算每个点最可能属于的聚类,认为该点即属于此聚类(其实也可以视作分布然后采样得到,但可能会不够稳定) + + sum_of_distance = 0 + for i in range(n): + sum_of_distance += distance(train_data[i], self.mu[train_label[i]]) + # 计算各点到其聚类中心距离和 + + if debug == True: + return self.mu, sum_of_distance + + + def predict(self, test_data): + + test_data = preprocess(test_data, self.preprocess_type) + + n = test_data.shape[0] # 训练数据数 + dim = test_data.shape[1] # 训练数据维数 + test_label = np.zeros((n)).astype(int) # 每个点所属的聚类,类型转换为整型 + gamma = np.zeros((n, self.K)) # GMM 中的参数,表达每个点属于每个聚类的可能性大小 + + for k in range(self.K): + tail = np.log(self.alpha[k]) - 0.5 * (dim * np.log(2 * np.pi) + np.log(np.linalg.det(self.cov[k]))) + cov_inv = np.linalg.inv(self.cov[k]) + for i in range(n): + dev = test_data[i] - self.mu[k] + gamma[i, k] = -0.5 * (np.matmul(np.matmul(dev.T, cov_inv), dev)) + tail + # 类似 fit 过程中 gamma 的计算 + + for i in range(n): + test_label[i] = np.nonzero((gamma[i, :] == gamma[i, :].max()))[0] + # 计算每个点最可能属于的聚类,认为该点即属于此聚类 + + sum_of_distance = 0 + for i in range(n): + sum_of_distance += distance(test_data[i], self.mu[test_label[i]]) + # 计算各点到其聚类中心距离和 + + return test_label + +#----------------- Utils ----------------- # +import matplotlib.pyplot as plt +from matplotlib import ticker + +def line_plot_init(n_clusters): + fig, ax = plt.subplots() + ax.xaxis.set_major_formatter(ticker.ScalarFormatter()) + ax.axvline(n_clusters, ls = '-', linewidth = 1.0, c = 'blue') + return fig, ax + +def line_plot(ax, X, Y, c, label): + ax.plot(X, Y, color = c, linewidth = 1.0, label = label) + ax.scatter(X, Y, color = c, s = 4.0, marker = 'o') + ax.legend() + +def line_plot_finish(fig, ax, filename): + fig.savefig(filename, dpi = 1000) + fig.show() + +def gen_random_filename(l = 10): + s = '' + for i in range(l): + s += (chr(np.random.randint(26) + ord('a'))) + s += ' ' + return s +#----------------- Utils ----------------- # + +class ClusteringAlgorithm: + def __init__(self, algorithm_tag = "ALL", max_clusters = 15, n = 0, dim = 0, n_clusters = 0): + """ + algorithm_tag: 需要尝试的算法,可从 "KMeans", "GMM" 和 "ALL"(全部尝试)中选择 + max_clusters: 最大聚类数量 + n, dim, n_clusters: 原始数据的生成参数 + """ + + self.algorithm_tag = algorithm_tag + self.max_clusters = max_clusters + self.n, self.dim, self.n_clusters = n, dim, n_clusters + + def fit(self, train_data): + fig, ax = line_plot_init(self.n_clusters) + # 初始化 plt ,同时标记出生成原始数据所用的聚类数 + + suffix = str(self.n) + ', ' + str(self.dim) + ', ' + str(self.n_clusters) + ".png" # 在文件名中记录有关参数 + + if self.algorithm_tag == "KMeans" or self.algorithm_tag == "ALL": + + sod = np.zeros((self.max_clusters)) # 初始化 sum_of_distance(各点到其聚类中心距离和) 数组 + + for n_clusters in range(1, self.max_clusters): + model = KMeans(n_clusters) + center, sod[n_clusters] = model.fit(train_data, iter_times = 20, debug = True) + # + + line_plot(ax, np.arange(1, self.max_clusters), sod[1: self.max_clusters], "C3", "KMeans") + # 对 sum_of_distance 做可视化 + + thres, min_sod, min_sod_K, step = 3, sod[1], 1, sod[1] * 0.1 + # 分别定义为 观察步数阈值、最小 sum_of_distance、最小 sum_of_distance 对应的聚类数、波动程度 + for n_clusters in range(2, self.max_clusters): + if sod[n_clusters] < min_sod - step: + min_sod = sod[n_clusters] + min_sod_K = n_clusters + thres = 3 + else: + thres -= 1 + if thres == 0: + break + """ + 这是自行定义的一套在 elbow method 中寻找拐点的方法,实际测试效果良好,非常符合人工观察结果 + - 首先,由于仅有一个聚类时的 sum_of_distance 变化一般很大,因此将波动程度 'step' 定义为 sod[1] 的 10% (经验参数) + - 随后,我们从前向后扫描 sum_of_distance 数组,要求每次 sum_of_distance 数组相对当前得到的 sum_of_distance 的最小值至少降低波动程度 'step' 的数量。如果符合要求,用这一项更新 sum_of_distance 最小值,以及备选的聚类数,并恢复观察步数阈值 'thres' 数量。否则认为失败,观察步数阈值 'thres' 减少 1 + - 如果观察步数阈值 'thres' 到达 0,则立刻结束扫描,当前的备选的聚类数即作为算法选择的聚类数,用于之后重新训练和预测 + """ + + ax.axvline(min_sod_K, ls = '--', linewidth = 1.0, c = 'C3') + # 标记出算法认为的最优聚类数量 + + # print(min_sod_K) + + self.model = KMeans(min_sod_K) + self.model.fit(train_data, iter_times = 50) + # 优先选择 K-Means 算法,并提高迭代次数,重新训练 + + + if self.algorithm_tag == "GMM" or self.algorithm_tag == "ALL": + + sod = np.zeros((self.max_clusters)) # 初始化 sum_of_distance(各点到其聚类中心距离和) 数组 + + for n_clusters in range(1, self.max_clusters): + model = GaussianMixture(n_clusters) + center, sod[n_clusters] = model.fit(train_data, iter_times = 20, debug = True) + + line_plot(ax, np.arange(1, self.max_clusters), sod[1: self.max_clusters], "C1", "GMM") + # 对 sum_of_distance 做可视化 + + thres, min_sod, min_sod_K, step = 3, sod[1], 1, sod[1] * 0.1 + for n_clusters in range(2, self.max_clusters): + if sod[n_clusters] < min_sod - step: + min_sod = sod[n_clusters] + min_sod_K = n_clusters + thres = 3 + else: + thres -= 1 + if thres == 0: + break + # 该方法作用和流程同上,请见上方详细介绍 + + ax.axvline(min_sod_K, ls = '--', linewidth = 1.0, c = 'C1') + # 标记出算法认为的最优聚类数量 + + # print(min_sod_K) + + if self.algorithm_tag == "GMM": + self.model = GaussianMixture(min_sod_K) + self.model.fit(train_data, iter_times = 50) + # 优先选择 K-Means 算法。如果明确指定,才使用 GMM 算法。随后提高迭代次数,重新训练。 + + line_plot_finish(fig, ax, gen_random_filename() + "Elbow Method " + self.algorithm_tag + str(self.max_clusters) + ' ' + suffix) + + def predict(self, test_data): + return self.model.predict(test_data) + + +import matplotlib.pyplot as plt + +def data_plot_init(): + return plt.subplots(1, 2) + +def data_plot(fig, ax, pos, data, label, center): + + for i in range(data.shape[0]): + ax[pos].scatter(data[i, 0], data[i, 1], color = "C" + str(label[i]), marker = 'o') + # 画出各点聚类结果 + for i in range(center.shape[0]): + ax[pos].scatter(center[i, 0], center[i, 1], color = "black", marker = 'x') + # 画出聚类中心 + +def data_plot_finish(fig, filename): + fig.savefig(filename, dpi = 1000) + # fig.show() + +def test_model(n = 500, dim = 2, n_clusters = 5, type = "GMM"): + """ + n: 向量数 + dim: 向量维数 + n_clusters: 生成数据的高斯分布聚类数 + type: 需要测试的算法,可从 "KMeans", "GMM" 和 "CA"(自动聚类)中选择 + """ + + center = np.random.rand(n_clusters, dim) * 30. # 随机生成 n_clusters 个聚类中心,乘一个系数扩大间距 + train_data = np.zeros((n, dim)) # 初始化训练数据数组 + train_label = np.zeros((n)).astype(int) # 初始化数据标号数组 + + for i in range(n): + train_label[i] = np.random.randint(n_clusters) + train_data[i] = np.random.normal(loc = center[train_label[i], :], scale = 1.) + # 随机选定第 i 个中心,以此为高斯分布均值,控制标准差为 1 ,采样生成数据 + + idx = np.arange(n) + np.random.shuffle(idx) + train_label, train_data = train_label[idx], train_data[idx] + # 打乱数据顺序 + + fig, ax = data_plot_init() + suffix = ' ' + + if dim == 2: + suffix = str(n) + ', ' + str(dim) + ', ' + str(n_clusters) + ".png" + data_plot(fig, ax, 0, train_data, train_label, center) + # 数据可视化,在文件名中记录有关参数 + + if type == "KMeans": # 测试 K-Means + model = KMeans(n_clusters) + _center, sod = model.fit(train_data, debug = True) # 设定 debug 为 True,会返回训练好的聚类中心、各点到其聚类中心距离和 + _label = model.predict(train_data) + + err = 0 + for i in range(n): + for j in range(i): + if (train_label[i] == train_label[j] and _label[i] != _label[j]) or (train_label[i] != train_label[j] and _label[i] == _label[j]): + err += 1 + print("Accuracy: {0}%".format(100 - err / (n * (n - 1) / 2) * 100)) + # 计算每两对点在原始数据和预测结果中的状态(是否处于同一类)是否一致,以此计算准确率 + + if dim == 2: + data_plot(fig, ax, 1, preprocess(train_data), _label, _center) + # 对二维数据的聚类结果做可视化 + + elif type == "GMM": # 测试 Gaussian Mixture Model + model = GaussianMixture(n_clusters) + _center, sod = model.fit(train_data, debug = True) # 设定 debug 为 True,会返回训练好的聚类中心、各点到其聚类中心距离和 + _label = model.predict(train_data) + + err = 0 + for i in range(n): + for j in range(i): + if (train_label[i] == train_label[j] and _label[i] != _label[j]) or (train_label[i] != train_label[j] and _label[i] == _label[j]): + err += 1 + print("Accuracy: {0}%".format(100 - err / (n * (n - 1) / 2) * 100)) + # 计算每两对点在原始数据和预测结果中的状态(是否处于同一类)是否一致,以此计算准确率 + + if dim == 2: + data_plot(fig, ax, 1, preprocess(train_data), _label, _center) + # 对二维数据的聚类结果做可视化 + + elif type == "CA": # 测试自动聚类算法 + model = ClusteringAlgorithm(n = n, dim = dim, n_clusters = n_clusters) + model.fit(train_data) + model.predict(train_data) + + if dim == 2: + data_plot_finish(fig, gen_random_filename() + suffix) + +# test_model() +# 开始测试! diff --git a/assignment-3/submission/19307130062/tester_demo.py b/assignment-3/submission/19307130062/tester_demo.py new file mode 100644 index 0000000000000000000000000000000000000000..2eb8ff3e26ad5546412a857fc0bb1e870c08b6de --- /dev/null +++ b/assignment-3/submission/19307130062/tester_demo.py @@ -0,0 +1,112 @@ +import numpy as np +import sys + +from source import KMeans, GaussianMixture + + +def shuffle(*datas): + data = np.concatenate(datas) + label = np.concatenate([ + np.ones((d.shape[0],), dtype=int)*i + for (i, d) in enumerate(datas) + ]) + N = data.shape[0] + idx = np.arange(N) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + return data, label + + +def data_1(): + mean = (1, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (800,)) + + mean = (16, -5) + cov = np.array([[21.2, 0], [0, 32.1]]) + y = np.random.multivariate_normal(mean, cov, (200,)) + + mean = (10, 22) + cov = np.array([[10, 5], [5, 10]]) + z = np.random.multivariate_normal(mean, cov, (1000,)) + + data, _ = shuffle(x, y, z) + return (data, data), 3 + + +def data_2(): + train_data = np.array([ + [23, 12, 173, 2134], + [99, -12, -126, -31], + [55, -145, -123, -342], + ]) + return (train_data, train_data), 2 + + +def data_3(): + train_data = np.array([ + [23], + [-2999], + [-2955], + ]) + return (train_data, train_data), 2 + + +def test_with_n_clusters(data_fuction, algorithm_class): + (train_data, test_data), n_clusters = data_fuction() + model = algorithm_class(n_clusters) + model.fit(train_data) + res = model.predict(test_data) + assert len( + res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong" + return res + + +def testcase_1_1(): + test_with_n_clusters(data_1, KMeans) + return True + + +def testcase_1_2(): + res = test_with_n_clusters(data_2, KMeans) + return res[0] != res[1] and res[1] == res[2] + + +def testcase_2_1(): + test_with_n_clusters(data_1, GaussianMixture) + return True + + +def testcase_2_2(): + res = test_with_n_clusters(data_3, GaussianMixture) + return res[0] != res[1] and res[1] == res[2] + + +def test_all(err_report=False): + testcases = [ + ["KMeans-1", testcase_1_1, 4], + ["KMeans-2", testcase_1_2, 4], + # ["KMeans-3", testcase_1_3, 4], + # ["KMeans-4", testcase_1_4, 4], + # ["KMeans-5", testcase_1_5, 4], + ["GMM-1", testcase_2_1, 4], + ["GMM-2", testcase_2_2, 4], + # ["GMM-3", testcase_2_3, 4], + # ["GMM-4", testcase_2_4, 4], + # ["GMM-5", testcase_2_5, 4], + ] + sum_score = sum([case[2] for case in testcases]) + score = 0 + for case in testcases: + res = case[2] if case[1]() else 0 + score += res + print("+ {:14} {}/{}".format(case[0], res, case[2])) + print("{:16} {}/{}".format("FINAL SCORE", score, sum_score)) + + +if __name__ == "__main__": + if len(sys.argv) > 1 and sys.argv[1] == "--report": + test_all(True) + else: + test_all()