diff --git a/assignment-3/handout/README.md b/assignment-3/handout/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e086659bd86f27b6eab9ffb5684e76f50641dde5
--- /dev/null
+++ b/assignment-3/handout/README.md
@@ -0,0 +1,43 @@
+# Assignment 3. 聚类算法
+
+- **姓名:高庆麾**
+- **学号:19307130062**
+
+
+
+## 第〇部分 算法实现介绍
+
+*对 K-Means 和 GMM 算法在代码中已经包含比较详细的注释,对大部分实现过程和细节都在相应位置做了标注和说明,请参看代码。
+
+
+
+## 第一部分 自动聚类算法实验
+
+生成数据组数 $n = 500$,维数 $dim = 2$ ,生成采用的高斯分布标准差为 $1$ ,中心从 $[0,\ 1]$ 中随机生成,但做了乘 $10$ 倍放大的处理,以增大间距。
+
+聚类数依次为 $3,\ 5,\ 7,\ 9,\ 11,\ 13$ 时的结果如下:
+
+




+
+如果仅通过肉眼观察,可以发现 elbow method 配合自行定义的一套选择拐点的方法(详细介绍请见代码对应部分注释)是非常符合人工观察的预期的,但是数据生成的聚类确实比较不规则,所在的位置在距离和上难以体现出来。
+
+
+
+### 第二部分 对聚类结果的可视化实验
+
+生成数据组数 $n = 500$,维数 $dim = 2$ ,生成采用的高斯分布标准差为 $1$ ,中心从 $[0,\ 1]$ 中随机生成,但做了乘 $10$ 倍放大的处理,以增大间距。
+
+KMeans 算法在迭代次数 $20$, 聚类数依次为 $5,\ 9,\ 13$ 时的结果如下:
+

+
+可以发现,在聚类数增多时,由于数据生成过于混杂,KMeans 给出了从肉眼观察上看更优的结果,尽管这可能和原始数据的生成方式有所区别。
+
+
+
+如果我们适当增大各个聚类的间距,GMM 算法在迭代次数 $20$, 聚类数依次为 $5,\ 9,\ 13$ 时的结果如下:
+

+
+可以看到很多时候,各个高斯分布的均值容易聚到一起,通过对算法的逻辑进行思考和分析,算法应该确实存在这方面的问题。
+
+
+
diff --git a/assignment-3/submission/19307130062/README.md b/assignment-3/submission/19307130062/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0dc22b88270f3e935c6160b60b0443b98021e005
--- /dev/null
+++ b/assignment-3/submission/19307130062/README.md
@@ -0,0 +1,45 @@
+# Assignment 3. 聚类算法
+
+- **姓名:高庆麾**
+- **学号:19307130062**
+
+
+
+## 第〇部分 算法实现介绍
+
+*对 K-Means 和 GMM 算法在代码中已经包含比较详细的注释,对大部分实现过程和细节都在相应位置做了标注和说明,请参看代码。
+
+实验实现了支持多维数据(大于两维)的高维 GMM 算法。详细介绍也请参看代码。
+
+
+
+## 第一部分 自动聚类算法实验
+
+生成数据组数 $n = 500$,维数 $dim = 2$ ,生成采用的高斯分布标准差为 $1$ ,中心从 $[0,\ 1]$ 中随机生成,但做了乘 $10$ 倍放大的处理,以增大间距。
+
+聚类数依次为 $3,\ 5,\ 7,\ 9,\ 11,\ 13$ 时的结果如下:
+
+




+
+如果仅通过肉眼观察,可以发现 elbow method 配合自行定义的一套选择拐点的方法(详细介绍请见代码对应部分注释)是非常符合人工观察的预期的,但是数据生成的聚类确实比较不规则,所在的位置在距离和上难以体现出来。
+
+
+
+## 第二部分 对聚类结果的可视化实验
+
+生成数据组数 $n = 500$,维数 $dim = 2$ ,生成采用的高斯分布标准差为 $1$ ,中心从 $[0,\ 1]$ 中随机生成,但做了乘 $10$ 倍放大的处理,以增大间距。
+
+KMeans 算法在迭代次数 $20$, 聚类数依次为 $5,\ 9,\ 13$ 时的结果如下:
+

+
+可以发现,在聚类数增多时,由于数据生成过于混杂,KMeans 给出了从肉眼观察上看更优的结果,尽管这可能和原始数据的生成方式有所区别。
+
+
+
+如果我们适当增大各个聚类的间距,GMM 算法在迭代次数 $20$, 聚类数依次为 $5,\ 9,\ 13$ 时的结果如下:
+

+
+可以看到很多时候,各个高斯分布的均值容易聚到一起,通过对算法的逻辑进行思考和分析,算法应该确实存在这方面的问题。
+
+
+
diff --git a/assignment-3/submission/19307130062/img/acgtcnevvn 500, 2, 9.png b/assignment-3/submission/19307130062/img/acgtcnevvn 500, 2, 9.png
new file mode 100644
index 0000000000000000000000000000000000000000..ccd4f133638733bc11b883b1c4efb28c973cfcd0
Binary files /dev/null and b/assignment-3/submission/19307130062/img/acgtcnevvn 500, 2, 9.png differ
diff --git a/assignment-3/submission/19307130062/img/boqdtamgmq Elbow Method ALL15 500, 2, 9.png b/assignment-3/submission/19307130062/img/boqdtamgmq Elbow Method ALL15 500, 2, 9.png
new file mode 100644
index 0000000000000000000000000000000000000000..1703390178b74d5edfb1b6c91fcd2be15c73d2a6
Binary files /dev/null and b/assignment-3/submission/19307130062/img/boqdtamgmq Elbow Method ALL15 500, 2, 9.png differ
diff --git a/assignment-3/submission/19307130062/img/heyhertslr Elbow Method ALL15 500, 2, 11.png b/assignment-3/submission/19307130062/img/heyhertslr Elbow Method ALL15 500, 2, 11.png
new file mode 100644
index 0000000000000000000000000000000000000000..fb5190b7c07a13650459efc2310e3fdd4af99601
Binary files /dev/null and b/assignment-3/submission/19307130062/img/heyhertslr Elbow Method ALL15 500, 2, 11.png differ
diff --git a/assignment-3/submission/19307130062/img/hkmxtjeiaj Elbow Method ALL15 500, 2, 5.png b/assignment-3/submission/19307130062/img/hkmxtjeiaj Elbow Method ALL15 500, 2, 5.png
new file mode 100644
index 0000000000000000000000000000000000000000..c9ff4a6a358650703bf75f4ae2060a9614ea620e
Binary files /dev/null and b/assignment-3/submission/19307130062/img/hkmxtjeiaj Elbow Method ALL15 500, 2, 5.png differ
diff --git a/assignment-3/submission/19307130062/img/maddmqffwl Elbow Method ALL15 500, 2, 13.png b/assignment-3/submission/19307130062/img/maddmqffwl Elbow Method ALL15 500, 2, 13.png
new file mode 100644
index 0000000000000000000000000000000000000000..6013b0c9b5fd5f5e196f4f1185cba797c0302fd4
Binary files /dev/null and b/assignment-3/submission/19307130062/img/maddmqffwl Elbow Method ALL15 500, 2, 13.png differ
diff --git a/assignment-3/submission/19307130062/img/ndrdqqpmea Elbow Method ALL15 500, 2, 7.png b/assignment-3/submission/19307130062/img/ndrdqqpmea Elbow Method ALL15 500, 2, 7.png
new file mode 100644
index 0000000000000000000000000000000000000000..8cda99abd6d831bce4fff3af8394eb04d13df80f
Binary files /dev/null and b/assignment-3/submission/19307130062/img/ndrdqqpmea Elbow Method ALL15 500, 2, 7.png differ
diff --git a/assignment-3/submission/19307130062/img/onaixzshct 500, 2, 13.png b/assignment-3/submission/19307130062/img/onaixzshct 500, 2, 13.png
new file mode 100644
index 0000000000000000000000000000000000000000..dc6d5517c5e3011fd8ecab17756a6522ebb4a4b3
Binary files /dev/null and b/assignment-3/submission/19307130062/img/onaixzshct 500, 2, 13.png differ
diff --git a/assignment-3/submission/19307130062/img/pbtjfkfwjw 500, 2, 13.png b/assignment-3/submission/19307130062/img/pbtjfkfwjw 500, 2, 13.png
new file mode 100644
index 0000000000000000000000000000000000000000..dc61f97062a45b7458efa5426d9e2c3507c5fd8c
Binary files /dev/null and b/assignment-3/submission/19307130062/img/pbtjfkfwjw 500, 2, 13.png differ
diff --git a/assignment-3/submission/19307130062/img/uvqnndbvpu 500, 2, 9.png b/assignment-3/submission/19307130062/img/uvqnndbvpu 500, 2, 9.png
new file mode 100644
index 0000000000000000000000000000000000000000..927d6bb4a872518e9e45bbaf8279535f841364c0
Binary files /dev/null and b/assignment-3/submission/19307130062/img/uvqnndbvpu 500, 2, 9.png differ
diff --git a/assignment-3/submission/19307130062/img/uxcngpxjhl 500, 2, 5.png b/assignment-3/submission/19307130062/img/uxcngpxjhl 500, 2, 5.png
new file mode 100644
index 0000000000000000000000000000000000000000..a69e8952e596f1e116461a938f1d0e0c54397ad4
Binary files /dev/null and b/assignment-3/submission/19307130062/img/uxcngpxjhl 500, 2, 5.png differ
diff --git a/assignment-3/submission/19307130062/img/vhklkqwtkz 500, 2, 5.png b/assignment-3/submission/19307130062/img/vhklkqwtkz 500, 2, 5.png
new file mode 100644
index 0000000000000000000000000000000000000000..bc886e28b1c714d7f7a84d32cb7f0a0eaf9de361
Binary files /dev/null and b/assignment-3/submission/19307130062/img/vhklkqwtkz 500, 2, 5.png differ
diff --git a/assignment-3/submission/19307130062/img/wishwtfbzz 500, 2, 9.png b/assignment-3/submission/19307130062/img/wishwtfbzz 500, 2, 9.png
new file mode 100644
index 0000000000000000000000000000000000000000..dc82075241af36966c6e0e73228fe28e0455a078
Binary files /dev/null and b/assignment-3/submission/19307130062/img/wishwtfbzz 500, 2, 9.png differ
diff --git a/assignment-3/submission/19307130062/img/wuxvkrvwya 500, 2, 13.png b/assignment-3/submission/19307130062/img/wuxvkrvwya 500, 2, 13.png
new file mode 100644
index 0000000000000000000000000000000000000000..8ba9029afec51a0b3be7804f4ec6dc2421c5521b
Binary files /dev/null and b/assignment-3/submission/19307130062/img/wuxvkrvwya 500, 2, 13.png differ
diff --git a/assignment-3/submission/19307130062/img/wyrclifhqj 500, 2, 5.png b/assignment-3/submission/19307130062/img/wyrclifhqj 500, 2, 5.png
new file mode 100644
index 0000000000000000000000000000000000000000..de70dc270b7fa94f8ba14326119526d9ccaccf83
Binary files /dev/null and b/assignment-3/submission/19307130062/img/wyrclifhqj 500, 2, 5.png differ
diff --git a/assignment-3/submission/19307130062/img/xhynyscjpi Elbow Method ALL15 500, 2, 3.png b/assignment-3/submission/19307130062/img/xhynyscjpi Elbow Method ALL15 500, 2, 3.png
new file mode 100644
index 0000000000000000000000000000000000000000..d45b29105396db5fc6ee6942a5c8245434327ec8
Binary files /dev/null and b/assignment-3/submission/19307130062/img/xhynyscjpi Elbow Method ALL15 500, 2, 3.png differ
diff --git a/assignment-3/submission/19307130062/source.py b/assignment-3/submission/19307130062/source.py
new file mode 100644
index 0000000000000000000000000000000000000000..245ced90da1ec2939f98772a5013cb57bf1c4c1e
--- /dev/null
+++ b/assignment-3/submission/19307130062/source.py
@@ -0,0 +1,456 @@
+import numpy as np
+
+def distance(x, y, type = "euclidean", lp = -1.):
+ """
+ 支持多样化距离的计算,包括欧几里得距离、曼哈顿距离、切比雪夫距离
+ 以及更一般的 Lp 范数(要求 p 为正实数,且不超过 100),操作不合法时
+ 返回 0.
+ """
+
+ x = x.astype(np.float64)
+ y = y.astype(np.float64)
+
+ if type == "euclidean":
+ return np.sqrt(sum((x - y)**2))
+
+ elif type == "manhattan":
+ return sum(abs(x - y))
+
+ elif type == "chebyshev":
+ return max(abs(x - y))
+
+ elif type == "Lp":
+ if 0 < lp <= 100.:
+ return sum((abs(x - y)) ** lp) ** (1. / lp)
+ else:
+ print("Error: Lp-norm is illegal.")
+ return 0.
+
+ else:
+ print("Error: can't identify the type of distance.")
+ return 0.
+
+
+def preprocess(x, type = "min_max"):
+ """
+ 支持 Min-Max 或 Z-score 方法的数据归一化预处理
+ """
+
+ x = x.astype(np.float64)
+
+ if type == "min_max":
+ for dim in range(x.shape[1]):
+ dim_min, dim_max = min(x[:, dim]), max(x[:, dim])
+ x[:, dim] = (x[:, dim] - dim_min) / (dim_max - dim_min)
+ return x
+
+ elif type == "z_score":
+ for dim in range(x.shape[1]):
+ std, mu = np.std(x[:, dim]), np.mean(x[:, dim])
+ x[:, dim] = (x[:, dim] - mu) / std
+ return x
+
+ elif type == "none":
+ return x
+
+class KMeans:
+
+ def __init__(self, n_clusters, preprocess_type = "min_max"): # 默认采用 Min-Max 归一化方法
+ self.K = n_clusters
+ self.preprocess_type = preprocess_type
+
+ def fit(self, train_data, iter_times = 10, debug = False):
+
+ train_data = preprocess(train_data, type = self.preprocess_type)
+ n = train_data.shape[0] # 训练数据数
+ cluster_size = np.zeros((self.K, 1)).astype(int) # 每个聚类中的点数,类型转换为整型
+ train_label = np.zeros((n)).astype(int) # 每个点所属的聚类,类型转换为整型
+
+ # self.center = np.random.rand(self.K, train_data.shape[1])
+ self.center = train_data[: self.K] # 取前 K 个样本作为初始聚类中心,应事先打乱数据
+
+ for T in range(iter_times):
+
+ # print("Iteration {0}:\t".format(T + 1), end = '')
+
+ sum_of_distance = 0.
+
+ for i in range(n):
+
+ min_dist = distance(train_data[i], self.center[self.K - 1])
+ min_dist_cluster = self.K - 1
+
+ for j in range(self.K - 1):
+ dis = distance(train_data[i], self.center[j])
+ if dis < min_dist:
+ min_dist = dis
+ min_dist_cluster = j
+
+ train_label[i] = min_dist_cluster
+ sum_of_distance += min_dist
+ # 求距离每个点最近的聚类中心
+
+ self.center.fill(0)
+ cluster_size.fill(0)
+
+ for i in range(n):
+ self.center[train_label[i]] += train_data[i]
+ cluster_size[train_label[i]] += 1
+
+ for j in range(self.K):
+ if cluster_size[j] == 0:
+ self.center[j] = np.random.rand(train_data.shape[1])
+ # 如果该聚类中无任何数据点,则随机一个新的聚类中心
+ else:
+ self.center[j] /= cluster_size[j]
+ # 更新新的聚类中心
+
+
+ # print("{0}".format(sum_of_distance))
+
+ if debug:
+ return self.center, sum_of_distance
+
+ def predict(self, test_data):
+
+ test_data = preprocess(test_data, type = self.preprocess_type)
+
+ n = test_data.shape[0] # 测试数据数
+ test_label = np.zeros((n)).astype(int) # 每个点所属的聚类,类型转换为整型
+
+ for i in range(n):
+
+ min_dist = distance(test_data[i], self.center[self.K - 1])
+ min_dist_cluster = self.K - 1
+
+ for j in range(self.K - 1):
+ dis = distance(test_data[i], self.center[j])
+ if dis < min_dist:
+ min_dist = dis
+ min_dist_cluster = j
+
+ test_label[i] = min_dist_cluster
+ # 求距离每个点最近的聚类中心
+
+ sum_of_distance = 0
+ for i in range(n):
+ sum_of_distance += distance(test_data[i], self.center[test_label[i]])
+ # 计算各点到其聚类中心距离和
+
+ return test_label
+
+class GaussianMixture:
+
+ def __init__(self, n_clusters, preprocess_type = "min_max"): # 默认采用 Min-Max 归一化方法
+ self.K = n_clusters
+ self.preprocess_type = preprocess_type
+
+ def fit(self, train_data, iter_times = 20, debug = False): # 支持多维高斯混合分布模型
+
+ train_data = preprocess(train_data, self.preprocess_type)
+
+ dim = train_data.shape[1] # 训练数据维数
+ n = train_data.shape[0] # 训练数据数
+ gamma = np.zeros((n, self.K)) # GMM 中的参数,表达每个点属于每个聚类的可能性大小
+ train_label = np.zeros((n)).astype(int) # 每个点所属的聚类,类型转换为整型
+
+ self.mu = np.random.rand(self.K, dim) # 高斯分布的均值数组
+ self.cov = np.array([np.eye(dim)] * self.K) * 0.1 # 高斯分布的协方差矩阵数组
+ self.alpha = np.array([1.0 / self.K] * self.K) # 高斯分布的隐变量数组
+ self.epsilon = 1e-10 # 用于防止数值问题的常量
+
+ for T in range(iter_times):
+ gamma.fill(0)
+
+ """
+ 查阅许多网上资料及 scipy 有关源码后,发现要想快速地(用矩阵乘法的形式)计算多个向量在某个
+ 多维高斯分布下的 gamma,需要将 cov 矩阵逆分解为两个互为转置的矩阵的乘积,分别和前后也互为
+ 转置关系的 (x - mu) 和 (x - mu)^T 相乘,方能实现批量计算。但上述方法较为复杂,此处采用
+ 每次计算一项 gamma 的暴力方法实现。
+
+ 为了尽可能避免数值问题,此处先求出 log(gamma),最后再进行指数计算将其复原。
+ """
+
+ for k in range(self.K):
+ tail = np.log(self.alpha[k]) - 0.5 * (dim * np.log(2 * np.pi) + np.log(np.linalg.det(self.cov[k])))
+ # 对含隐变量高斯分布取 log 后舍弃 exp 部分的余项,可以事先计算出来
+ cov_inv = np.linalg.inv(self.cov[k])
+ # 对协方差矩阵求逆
+
+ for i in range(n):
+ dev = train_data[i] - self.mu[k]
+ gamma[i, k] = -0.5 * (np.matmul(np.matmul(dev, cov_inv), dev.T)) + tail
+ # 计算 exp 部分的数值
+
+ gamma = np.exp(gamma)
+ # 通过 exp 还原 gamma
+ gamma /= (gamma.sum(axis = 1).reshape(gamma.shape[0], -1))
+ # 对同一数据点做其在每个聚类上的 gamma 的归一化
+
+ self.mu = np.matmul(gamma.T, train_data) / gamma.sum(axis = 0).reshape(self.K, -1)
+ # 更新高斯分布的均值数组,此处的计算利用了广播机制
+ for k in range(self.K):
+ dev = gamma[:, k].reshape(n, -1) * (train_data - self.mu[k]) #XXX
+ self.cov[k] = np.matmul(dev.T, train_data - self.mu[k]) / gamma[:, k].sum() #XXX
+ self.cov[k] += np.eye(dim) * self.epsilon # 在对角线加上 \lambda I ,使协方差矩阵保持非奇异性质
+ # 更新高斯分布的协方差矩阵数组
+ self.alpha = gamma.sum(axis = 0) / n
+ # 更新高斯分布的隐变量数组
+
+ for i in range(n):
+ train_label[i] = np.nonzero((gamma[i, :] == gamma[i, :].max()))[0]
+ # 计算每个点最可能属于的聚类,认为该点即属于此聚类(其实也可以视作分布然后采样得到,但可能会不够稳定)
+
+ sum_of_distance = 0
+ for i in range(n):
+ sum_of_distance += distance(train_data[i], self.mu[train_label[i]])
+ # 计算各点到其聚类中心距离和
+
+ if debug == True:
+ return self.mu, sum_of_distance
+
+
+ def predict(self, test_data):
+
+ test_data = preprocess(test_data, self.preprocess_type)
+
+ n = test_data.shape[0] # 训练数据数
+ dim = test_data.shape[1] # 训练数据维数
+ test_label = np.zeros((n)).astype(int) # 每个点所属的聚类,类型转换为整型
+ gamma = np.zeros((n, self.K)) # GMM 中的参数,表达每个点属于每个聚类的可能性大小
+
+ for k in range(self.K):
+ tail = np.log(self.alpha[k]) - 0.5 * (dim * np.log(2 * np.pi) + np.log(np.linalg.det(self.cov[k])))
+ cov_inv = np.linalg.inv(self.cov[k])
+ for i in range(n):
+ dev = test_data[i] - self.mu[k]
+ gamma[i, k] = -0.5 * (np.matmul(np.matmul(dev.T, cov_inv), dev)) + tail
+ # 类似 fit 过程中 gamma 的计算
+
+ for i in range(n):
+ test_label[i] = np.nonzero((gamma[i, :] == gamma[i, :].max()))[0]
+ # 计算每个点最可能属于的聚类,认为该点即属于此聚类
+
+ sum_of_distance = 0
+ for i in range(n):
+ sum_of_distance += distance(test_data[i], self.mu[test_label[i]])
+ # 计算各点到其聚类中心距离和
+
+ return test_label
+
+#----------------- Utils ----------------- #
+import matplotlib.pyplot as plt
+from matplotlib import ticker
+
+def line_plot_init(n_clusters):
+ fig, ax = plt.subplots()
+ ax.xaxis.set_major_formatter(ticker.ScalarFormatter())
+ ax.axvline(n_clusters, ls = '-', linewidth = 1.0, c = 'blue')
+ return fig, ax
+
+def line_plot(ax, X, Y, c, label):
+ ax.plot(X, Y, color = c, linewidth = 1.0, label = label)
+ ax.scatter(X, Y, color = c, s = 4.0, marker = 'o')
+ ax.legend()
+
+def line_plot_finish(fig, ax, filename):
+ fig.savefig(filename, dpi = 1000)
+ fig.show()
+
+def gen_random_filename(l = 10):
+ s = ''
+ for i in range(l):
+ s += (chr(np.random.randint(26) + ord('a')))
+ s += ' '
+ return s
+#----------------- Utils ----------------- #
+
+class ClusteringAlgorithm:
+ def __init__(self, algorithm_tag = "ALL", max_clusters = 15, n = 0, dim = 0, n_clusters = 0):
+ """
+ algorithm_tag: 需要尝试的算法,可从 "KMeans", "GMM" 和 "ALL"(全部尝试)中选择
+ max_clusters: 最大聚类数量
+ n, dim, n_clusters: 原始数据的生成参数
+ """
+
+ self.algorithm_tag = algorithm_tag
+ self.max_clusters = max_clusters
+ self.n, self.dim, self.n_clusters = n, dim, n_clusters
+
+ def fit(self, train_data):
+ fig, ax = line_plot_init(self.n_clusters)
+ # 初始化 plt ,同时标记出生成原始数据所用的聚类数
+
+ suffix = str(self.n) + ', ' + str(self.dim) + ', ' + str(self.n_clusters) + ".png" # 在文件名中记录有关参数
+
+ if self.algorithm_tag == "KMeans" or self.algorithm_tag == "ALL":
+
+ sod = np.zeros((self.max_clusters)) # 初始化 sum_of_distance(各点到其聚类中心距离和) 数组
+
+ for n_clusters in range(1, self.max_clusters):
+ model = KMeans(n_clusters)
+ center, sod[n_clusters] = model.fit(train_data, iter_times = 20, debug = True)
+ #
+
+ line_plot(ax, np.arange(1, self.max_clusters), sod[1: self.max_clusters], "C3", "KMeans")
+ # 对 sum_of_distance 做可视化
+
+ thres, min_sod, min_sod_K, step = 3, sod[1], 1, sod[1] * 0.1
+ # 分别定义为 观察步数阈值、最小 sum_of_distance、最小 sum_of_distance 对应的聚类数、波动程度
+ for n_clusters in range(2, self.max_clusters):
+ if sod[n_clusters] < min_sod - step:
+ min_sod = sod[n_clusters]
+ min_sod_K = n_clusters
+ thres = 3
+ else:
+ thres -= 1
+ if thres == 0:
+ break
+ """
+ 这是自行定义的一套在 elbow method 中寻找拐点的方法,实际测试效果良好,非常符合人工观察结果
+ - 首先,由于仅有一个聚类时的 sum_of_distance 变化一般很大,因此将波动程度 'step' 定义为 sod[1] 的 10% (经验参数)
+ - 随后,我们从前向后扫描 sum_of_distance 数组,要求每次 sum_of_distance 数组相对当前得到的 sum_of_distance 的最小值至少降低波动程度 'step' 的数量。如果符合要求,用这一项更新 sum_of_distance 最小值,以及备选的聚类数,并恢复观察步数阈值 'thres' 数量。否则认为失败,观察步数阈值 'thres' 减少 1
+ - 如果观察步数阈值 'thres' 到达 0,则立刻结束扫描,当前的备选的聚类数即作为算法选择的聚类数,用于之后重新训练和预测
+ """
+
+ ax.axvline(min_sod_K, ls = '--', linewidth = 1.0, c = 'C3')
+ # 标记出算法认为的最优聚类数量
+
+ # print(min_sod_K)
+
+ self.model = KMeans(min_sod_K)
+ self.model.fit(train_data, iter_times = 50)
+ # 优先选择 K-Means 算法,并提高迭代次数,重新训练
+
+
+ if self.algorithm_tag == "GMM" or self.algorithm_tag == "ALL":
+
+ sod = np.zeros((self.max_clusters)) # 初始化 sum_of_distance(各点到其聚类中心距离和) 数组
+
+ for n_clusters in range(1, self.max_clusters):
+ model = GaussianMixture(n_clusters)
+ center, sod[n_clusters] = model.fit(train_data, iter_times = 20, debug = True)
+
+ line_plot(ax, np.arange(1, self.max_clusters), sod[1: self.max_clusters], "C1", "GMM")
+ # 对 sum_of_distance 做可视化
+
+ thres, min_sod, min_sod_K, step = 3, sod[1], 1, sod[1] * 0.1
+ for n_clusters in range(2, self.max_clusters):
+ if sod[n_clusters] < min_sod - step:
+ min_sod = sod[n_clusters]
+ min_sod_K = n_clusters
+ thres = 3
+ else:
+ thres -= 1
+ if thres == 0:
+ break
+ # 该方法作用和流程同上,请见上方详细介绍
+
+ ax.axvline(min_sod_K, ls = '--', linewidth = 1.0, c = 'C1')
+ # 标记出算法认为的最优聚类数量
+
+ # print(min_sod_K)
+
+ if self.algorithm_tag == "GMM":
+ self.model = GaussianMixture(min_sod_K)
+ self.model.fit(train_data, iter_times = 50)
+ # 优先选择 K-Means 算法。如果明确指定,才使用 GMM 算法。随后提高迭代次数,重新训练。
+
+ line_plot_finish(fig, ax, gen_random_filename() + "Elbow Method " + self.algorithm_tag + str(self.max_clusters) + ' ' + suffix)
+
+ def predict(self, test_data):
+ return self.model.predict(test_data)
+
+
+import matplotlib.pyplot as plt
+
+def data_plot_init():
+ return plt.subplots(1, 2)
+
+def data_plot(fig, ax, pos, data, label, center):
+
+ for i in range(data.shape[0]):
+ ax[pos].scatter(data[i, 0], data[i, 1], color = "C" + str(label[i]), marker = 'o')
+ # 画出各点聚类结果
+ for i in range(center.shape[0]):
+ ax[pos].scatter(center[i, 0], center[i, 1], color = "black", marker = 'x')
+ # 画出聚类中心
+
+def data_plot_finish(fig, filename):
+ fig.savefig(filename, dpi = 1000)
+ # fig.show()
+
+def test_model(n = 500, dim = 2, n_clusters = 5, type = "GMM"):
+ """
+ n: 向量数
+ dim: 向量维数
+ n_clusters: 生成数据的高斯分布聚类数
+ type: 需要测试的算法,可从 "KMeans", "GMM" 和 "CA"(自动聚类)中选择
+ """
+
+ center = np.random.rand(n_clusters, dim) * 30. # 随机生成 n_clusters 个聚类中心,乘一个系数扩大间距
+ train_data = np.zeros((n, dim)) # 初始化训练数据数组
+ train_label = np.zeros((n)).astype(int) # 初始化数据标号数组
+
+ for i in range(n):
+ train_label[i] = np.random.randint(n_clusters)
+ train_data[i] = np.random.normal(loc = center[train_label[i], :], scale = 1.)
+ # 随机选定第 i 个中心,以此为高斯分布均值,控制标准差为 1 ,采样生成数据
+
+ idx = np.arange(n)
+ np.random.shuffle(idx)
+ train_label, train_data = train_label[idx], train_data[idx]
+ # 打乱数据顺序
+
+ fig, ax = data_plot_init()
+ suffix = ' '
+
+ if dim == 2:
+ suffix = str(n) + ', ' + str(dim) + ', ' + str(n_clusters) + ".png"
+ data_plot(fig, ax, 0, train_data, train_label, center)
+ # 数据可视化,在文件名中记录有关参数
+
+ if type == "KMeans": # 测试 K-Means
+ model = KMeans(n_clusters)
+ _center, sod = model.fit(train_data, debug = True) # 设定 debug 为 True,会返回训练好的聚类中心、各点到其聚类中心距离和
+ _label = model.predict(train_data)
+
+ err = 0
+ for i in range(n):
+ for j in range(i):
+ if (train_label[i] == train_label[j] and _label[i] != _label[j]) or (train_label[i] != train_label[j] and _label[i] == _label[j]):
+ err += 1
+ print("Accuracy: {0}%".format(100 - err / (n * (n - 1) / 2) * 100))
+ # 计算每两对点在原始数据和预测结果中的状态(是否处于同一类)是否一致,以此计算准确率
+
+ if dim == 2:
+ data_plot(fig, ax, 1, preprocess(train_data), _label, _center)
+ # 对二维数据的聚类结果做可视化
+
+ elif type == "GMM": # 测试 Gaussian Mixture Model
+ model = GaussianMixture(n_clusters)
+ _center, sod = model.fit(train_data, debug = True) # 设定 debug 为 True,会返回训练好的聚类中心、各点到其聚类中心距离和
+ _label = model.predict(train_data)
+
+ err = 0
+ for i in range(n):
+ for j in range(i):
+ if (train_label[i] == train_label[j] and _label[i] != _label[j]) or (train_label[i] != train_label[j] and _label[i] == _label[j]):
+ err += 1
+ print("Accuracy: {0}%".format(100 - err / (n * (n - 1) / 2) * 100))
+ # 计算每两对点在原始数据和预测结果中的状态(是否处于同一类)是否一致,以此计算准确率
+
+ if dim == 2:
+ data_plot(fig, ax, 1, preprocess(train_data), _label, _center)
+ # 对二维数据的聚类结果做可视化
+
+ elif type == "CA": # 测试自动聚类算法
+ model = ClusteringAlgorithm(n = n, dim = dim, n_clusters = n_clusters)
+ model.fit(train_data)
+ model.predict(train_data)
+
+ if dim == 2:
+ data_plot_finish(fig, gen_random_filename() + suffix)
+
+# test_model()
+# 开始测试!
diff --git a/assignment-3/submission/19307130062/tester_demo.py b/assignment-3/submission/19307130062/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..2eb8ff3e26ad5546412a857fc0bb1e870c08b6de
--- /dev/null
+++ b/assignment-3/submission/19307130062/tester_demo.py
@@ -0,0 +1,112 @@
+import numpy as np
+import sys
+
+from source import KMeans, GaussianMixture
+
+
+def shuffle(*datas):
+ data = np.concatenate(datas)
+ label = np.concatenate([
+ np.ones((d.shape[0],), dtype=int)*i
+ for (i, d) in enumerate(datas)
+ ])
+ N = data.shape[0]
+ idx = np.arange(N)
+ np.random.shuffle(idx)
+ data = data[idx]
+ label = label[idx]
+ return data, label
+
+
+def data_1():
+ mean = (1, 2)
+ cov = np.array([[73, 0], [0, 22]])
+ x = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (16, -5)
+ cov = np.array([[21.2, 0], [0, 32.1]])
+ y = np.random.multivariate_normal(mean, cov, (200,))
+
+ mean = (10, 22)
+ cov = np.array([[10, 5], [5, 10]])
+ z = np.random.multivariate_normal(mean, cov, (1000,))
+
+ data, _ = shuffle(x, y, z)
+ return (data, data), 3
+
+
+def data_2():
+ train_data = np.array([
+ [23, 12, 173, 2134],
+ [99, -12, -126, -31],
+ [55, -145, -123, -342],
+ ])
+ return (train_data, train_data), 2
+
+
+def data_3():
+ train_data = np.array([
+ [23],
+ [-2999],
+ [-2955],
+ ])
+ return (train_data, train_data), 2
+
+
+def test_with_n_clusters(data_fuction, algorithm_class):
+ (train_data, test_data), n_clusters = data_fuction()
+ model = algorithm_class(n_clusters)
+ model.fit(train_data)
+ res = model.predict(test_data)
+ assert len(
+ res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong"
+ return res
+
+
+def testcase_1_1():
+ test_with_n_clusters(data_1, KMeans)
+ return True
+
+
+def testcase_1_2():
+ res = test_with_n_clusters(data_2, KMeans)
+ return res[0] != res[1] and res[1] == res[2]
+
+
+def testcase_2_1():
+ test_with_n_clusters(data_1, GaussianMixture)
+ return True
+
+
+def testcase_2_2():
+ res = test_with_n_clusters(data_3, GaussianMixture)
+ return res[0] != res[1] and res[1] == res[2]
+
+
+def test_all(err_report=False):
+ testcases = [
+ ["KMeans-1", testcase_1_1, 4],
+ ["KMeans-2", testcase_1_2, 4],
+ # ["KMeans-3", testcase_1_3, 4],
+ # ["KMeans-4", testcase_1_4, 4],
+ # ["KMeans-5", testcase_1_5, 4],
+ ["GMM-1", testcase_2_1, 4],
+ ["GMM-2", testcase_2_2, 4],
+ # ["GMM-3", testcase_2_3, 4],
+ # ["GMM-4", testcase_2_4, 4],
+ # ["GMM-5", testcase_2_5, 4],
+ ]
+ sum_score = sum([case[2] for case in testcases])
+ score = 0
+ for case in testcases:
+ res = case[2] if case[1]() else 0
+ score += res
+ print("+ {:14} {}/{}".format(case[0], res, case[2]))
+ print("{:16} {}/{}".format("FINAL SCORE", score, sum_score))
+
+
+if __name__ == "__main__":
+ if len(sys.argv) > 1 and sys.argv[1] == "--report":
+ test_all(True)
+ else:
+ test_all()