diff --git a/assignment-3/submission/17307130133/.keep b/assignment-3/submission/17307130133/.keep
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/assignment-3/submission/17307130133/README.md b/assignment-3/submission/17307130133/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..f113f2dfa10885bdb363b9190746487339ef287a
--- /dev/null
+++ b/assignment-3/submission/17307130133/README.md
@@ -0,0 +1,151 @@
+# 作业三 聚类算法
+
+实现了K-Means++和GMM模型,基于轮廓系数实现了自动选择聚簇数量的方法。
+
+## K-Means和GMM模型的实现
+
+### K-Means模型
+
+K-Means算法的基本思想是:通过迭代寻找k个聚类的中心,使得用这k个聚类的均值代表相应各类样本时误差最小。误差指的是每个样本点到相应聚类中心距离之和。
+
+其算法流程如下
+
+```python
+init_center()
+while cluster_changed:
+ for each in train_data:
+ for center in centers:
+ cal disance between center and each point
+ assign each point to nearest cluster
+ for each in clusters:
+ cal mean in the cluster
+ change the cluster center to this mean val
+```
+
+#### K-Means++
+
+由于基础的K-Means算法随机选取k个聚类中心点,可能会选取到距离十分靠近的数据点,不仅导致算法收敛很慢,还会容易陷入局部最小值,聚类效果不稳定。
+
+采用K-Means++方法初始聚类中心点以解决这个问题,其算法流程如下
+
+```python
+randomly choose the first center
+for i in range(1, n_clusters):
+ for each in train_data:
+ cal distance to chosen centers
+ let D(each) be the min distance
+ choose the one point with the largest D
+```
+
+### GMM
+
+GMM模型是由多个高斯分布组成的模型,其总体密度函数为多个高斯密度函数的加权组合。从高斯混合模型中生成一个样本x的过程可以分为两步,先根据多项分布随机选择一个高斯分布,再从选中的高斯分布中选取一个样本x。
+
+给定N个由高斯混合模型生成的训练样本,希望能学习其中的参数$\pi_k,\mu_k,\sigma_k.$
+
+根据EM算法,参数估计可以分为两步进行迭代:
+
+(1)E步 固定参数$\mu,\sigma$,计算后验分布
+$$
+\gamma_{nk} \triangleq p(z^{(n)}=k|x^{(n)})\\\\
+=\frac{\pi_k N(x^{(n)};\mu_k,\sigma_k)}{\sum_{k=1}^{K}\pi_kN(x^{(n)};\mu_k,\sigma_k)}
+$$
+(2)M步 令$q(z=k)=\gamma_{nk}$,训练集D的证据下界为
+$$
+ELBO(\gamma,D;\pi,\mu,\sigma)=\sum_{n=1}^N\sum_{k=1}^K\gamma_{nk}(\frac{-(x-\mu_k)^2}{2\sigma_k^2}-log\sigma_k+log\pi_k)+C
+$$
+将参数估计问题转化为优化问题,约束条件为
+$$
+\sum_{k=1}^K\pi_k=1
+$$
+利用拉格朗日乘数法求解,通过求偏导使其为0的办法,可得
+$$
+\pi_k=\frac{N_k}{N},
+\mu_k=\frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}x^{(n)},
+\sigma_k^2=\frac{1}{N_k}\sum_{n=1}^N\gamma_{nk}(x^{(n)}-\mu_k)^2.
+\\\\N_k=\sum_{n=1}^{N}\gamma_{nk}
+$$
+GMM模型初始化较为重要,这里使用最大迭代次数为20的K-Means++进行初始化。
+
+## 实验
+
+### 对比K-Means和K-Means++
+
+下图是进行聚类的数据
+
+
+
+下图是使用K-Means初始化,聚类该数据的结果:
+
+
+
+
+
+下图是使用K-Means++初始化之后,聚类该数据的结果,该结果非常稳定。
+
+
+
+可以看出,使用K-Means++初始化的实验结果与目的相符,可以逃离局部最小值,稳定聚类结果。
+
+与此同时,进行10次实验后,使用K-Means++初始化需要的平均轮次为4,使用K-Mean初始化需要的平均轮次为7。所以,使用K-Means++初始化可以降低聚类轮次。
+
+### 对比K-Means++和GMM
+
+使用同样的数据对比K-Means++和GMM。下图是GMM进行聚类的效果
+
+
+
+可以看出,在该数据集上的表现,GMM表现与K-Means++类似。
+
+接下来生成三簇距离较近但是数量较大的的数据集来进一步比较这两种聚类算法。数据集如下:
+
+
+
+K-Means++聚类结果如下:
+
+
+
+GMM聚类结果如下:
+
+
+
+可以看出,此种情况时K-Means++略优于GMM,但是差别依然不大。
+
+虽然GMM聚类效果与K-Means++相差不大,但是运行时间要比K-Means++长不少,因为需要使用K-Means++模型进行参数的初始化。
+
+## 自动选择聚簇数量
+
+轮廓系数(Silhouette Coefficient)是评价聚类好坏的指标,结合了内聚度和分离度两个因素。其计算方法如下:
+
+```python
+for point in dataset:
+ a(point) = avg distance from point to all points in its cluster
+ b(point) = min(avg distance from point to other clusters)
+ silhouette_coefficient(point) = (b(point) - a(point)) / max{a(point), b(point)}
+```
+
+显然,轮廓系数取值范围为[-1, 1],取值越接近1则说明聚类性能越好,取值越接近-1则说明聚类性能越差。进一步地,将数据集内所有点的轮廓系数求平均就是该聚类结果总的轮廓系数。可以利用该指标来自动选择聚簇数量。具体来说,分别求属于[2, 20]的聚簇数量的总轮廓系数,然后选择对应总轮廓系数最大的聚簇数量即可。
+
+### 实验
+
+使用如下数据集进行自动选择聚类数量的实验:
+
+
+
+最终K-Means选择的聚簇数量为3,聚类效果如下:
+
+
+
+GMM选择的聚簇数量不稳定,有时是2,有时是3,聚类效果如下:
+
+
+
+## 使用方法
+
+```python
+python source.py --kmeans # 生成数据,使用K-Means模型进行聚类并观察效果
+python source.py --gm # 生成数据,使用GMM进行聚类并观察效果
+python source.py --auto_kmeans # 生成数据,自动选择聚簇数量并使用K-Means模型进行聚类并观察效果
+python source.py --auto_gm # 生成数据,自动选择聚簇数量并使用GMM进行聚类并观察效果
+```
+
diff --git a/assignment-3/submission/17307130133/img/.keep b/assignment-3/submission/17307130133/img/.keep
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/assignment-3/submission/17307130133/img/auto_gm.png b/assignment-3/submission/17307130133/img/auto_gm.png
new file mode 100644
index 0000000000000000000000000000000000000000..92760d2f0e72f2502b52111580269f58b6cebd3b
Binary files /dev/null and b/assignment-3/submission/17307130133/img/auto_gm.png differ
diff --git a/assignment-3/submission/17307130133/img/auto_gm2.png b/assignment-3/submission/17307130133/img/auto_gm2.png
new file mode 100644
index 0000000000000000000000000000000000000000..a92da7446d996c8abb6c1344ea2814a2e424025e
Binary files /dev/null and b/assignment-3/submission/17307130133/img/auto_gm2.png differ
diff --git a/assignment-3/submission/17307130133/img/expr1.png b/assignment-3/submission/17307130133/img/expr1.png
new file mode 100644
index 0000000000000000000000000000000000000000..ab23d794a0897090f25c03875a8101d15d516d49
Binary files /dev/null and b/assignment-3/submission/17307130133/img/expr1.png differ
diff --git a/assignment-3/submission/17307130133/img/expr2.png b/assignment-3/submission/17307130133/img/expr2.png
new file mode 100644
index 0000000000000000000000000000000000000000..68b3b83a8e3cae224ef4db1599712562c72bda82
Binary files /dev/null and b/assignment-3/submission/17307130133/img/expr2.png differ
diff --git a/assignment-3/submission/17307130133/img/gm1.png b/assignment-3/submission/17307130133/img/gm1.png
new file mode 100644
index 0000000000000000000000000000000000000000..6077cf252933a9714a6c93aa365ec00304f9816c
Binary files /dev/null and b/assignment-3/submission/17307130133/img/gm1.png differ
diff --git a/assignment-3/submission/17307130133/img/gm2.png b/assignment-3/submission/17307130133/img/gm2.png
new file mode 100644
index 0000000000000000000000000000000000000000..5402953bfca724837c03e953a66f2713186a1f33
Binary files /dev/null and b/assignment-3/submission/17307130133/img/gm2.png differ
diff --git a/assignment-3/submission/17307130133/img/kmeans++.png b/assignment-3/submission/17307130133/img/kmeans++.png
new file mode 100644
index 0000000000000000000000000000000000000000..e0a190c7e647c11f1f780abbad3dfa0621cfebba
Binary files /dev/null and b/assignment-3/submission/17307130133/img/kmeans++.png differ
diff --git a/assignment-3/submission/17307130133/img/kmeans++2.png b/assignment-3/submission/17307130133/img/kmeans++2.png
new file mode 100644
index 0000000000000000000000000000000000000000..a202ae37b7a2a35f5e39366226d2c88b8d8d6ab2
Binary files /dev/null and b/assignment-3/submission/17307130133/img/kmeans++2.png differ
diff --git a/assignment-3/submission/17307130133/img/kmeans++3.png b/assignment-3/submission/17307130133/img/kmeans++3.png
new file mode 100644
index 0000000000000000000000000000000000000000..7ce6a3929940818607e0a76b1383849ebcadf21a
Binary files /dev/null and b/assignment-3/submission/17307130133/img/kmeans++3.png differ
diff --git a/assignment-3/submission/17307130133/img/kmeans1.png b/assignment-3/submission/17307130133/img/kmeans1.png
new file mode 100644
index 0000000000000000000000000000000000000000..b3ec8f99dd5212c0de75fb11f3eddadbcbcb5286
Binary files /dev/null and b/assignment-3/submission/17307130133/img/kmeans1.png differ
diff --git a/assignment-3/submission/17307130133/img/kmeans2.png b/assignment-3/submission/17307130133/img/kmeans2.png
new file mode 100644
index 0000000000000000000000000000000000000000..69b2dba7879165d5f8b91db7457eab2157ace3c3
Binary files /dev/null and b/assignment-3/submission/17307130133/img/kmeans2.png differ
diff --git a/assignment-3/submission/17307130133/img/kmeans3.png b/assignment-3/submission/17307130133/img/kmeans3.png
new file mode 100644
index 0000000000000000000000000000000000000000..3d190f0fe374c78280180bce479f1eca0cbef429
Binary files /dev/null and b/assignment-3/submission/17307130133/img/kmeans3.png differ
diff --git a/assignment-3/submission/17307130133/img/kmeans4.png b/assignment-3/submission/17307130133/img/kmeans4.png
new file mode 100644
index 0000000000000000000000000000000000000000..a50d3f3a82c7a3c1822a92eebcbb6dc57f991d06
Binary files /dev/null and b/assignment-3/submission/17307130133/img/kmeans4.png differ
diff --git a/assignment-3/submission/17307130133/source.py b/assignment-3/submission/17307130133/source.py
new file mode 100644
index 0000000000000000000000000000000000000000..9d2324d41b39af976c57e578cf61e9cfccc64a51
--- /dev/null
+++ b/assignment-3/submission/17307130133/source.py
@@ -0,0 +1,350 @@
+import math
+import sys
+import numpy as np
+import matplotlib.pyplot as plt
+
+
+def ecul_distance(x, y):
+ """
+ :param x: vector
+ :param y: vector
+ :return: eculidean distance between x and y
+ """
+ return np.sqrt(np.nansum((x - y) ** 2))
+
+
+def draw(n_clusters, label, points):
+ """
+ visualize the clustering result of data
+ :param n_clusters: number of clusters
+ :param label: label of data
+ :param points: data
+ :return: None
+ """
+ for i in range(n_clusters):
+ cluster_i = np.where(label == i)
+ plt.scatter(points[cluster_i][:, 0], points[cluster_i][:, 1])
+ plt.show()
+
+
+class KMeans:
+
+ def __init__(self, n_clusters, max_iterations=100):
+ self.k = n_clusters
+ self.cluster_centers = []
+ self.max_iterations = max_iterations
+
+ def get_nearest_center(self, x):
+ """
+ get nearest center index to x
+ :param x: point index in train_data
+ :return: center index
+ """
+ center = -1
+ min_dis = 1e9
+ for i in range(len(self.cluster_centers)):
+ dis = ecul_distance(self.cluster_centers[i], x)
+ if dis < min_dis:
+ min_dis = dis
+ center = i
+ return center
+
+ def init_center(self):
+ """
+ init centers in KMeans++
+ :return: None
+ """
+ n = len(self.train_data)
+ # choose the first center randomly
+ self.cluster_centers.append(self.train_data[np.random.choice(n)])
+ # choose the rest k-1 centers
+ for i in range(1, self.k):
+ # find the point with the largest distance to the nearest center
+ dis_point_center = []
+ # for each sample
+ for j in range(n):
+ nearest_center = self.get_nearest_center(j)
+ dis = ecul_distance(self.train_data[j], self.cluster_centers[nearest_center])
+ dis_point_center.append(dis)
+ # add this point to centers
+ self.cluster_centers.append(self.train_data[np.nanargmax(dis_point_center)])
+
+ def fit(self, train_data):
+ self.train_data = train_data
+ self.init_center()
+ # stores which cluster this sample belongs to
+ cluster_assment = np.zeros(self.train_data.shape[0])
+
+ center_changed = True
+ iteration = 0
+ while center_changed and iteration < self.max_iterations:
+ center_changed = False
+ # for each sample
+ for i in range(len(self.train_data)):
+ # find neatest center
+ min_idx = self.get_nearest_center(self.train_data[i])
+
+ # update cluster
+ if cluster_assment[i] != min_idx:
+ center_changed = True
+ cluster_assment[i] = min_idx
+
+ # update center
+ for j in range(self.k):
+ # find all points in cluster j
+ points_in_cluster = []
+ not_none = False
+ for k in range(train_data.shape[0]):
+ if cluster_assment[k] == j:
+ points_in_cluster.append(True)
+ not_none = True
+ else:
+ points_in_cluster.append(False)
+ # update center
+ if not_none:
+ self.cluster_centers[j] = np.nanmean(train_data[points_in_cluster], axis=0)
+
+ iteration = iteration + 1
+
+ print("iterations for k-means: %d" % iteration)
+
+ def predict(self, test_data, show=False):
+ """
+ predict (and visualize result) with k-means
+ :param test_data: test_data
+ :param show: True to visualize result and False not to
+ :return:
+ """
+ ret = []
+ for i in range(len(test_data)):
+ ret.append(self.get_nearest_center(test_data[i]))
+ if show:
+ draw(self.k, np.array(ret), test_data)
+ return np.array(ret)
+
+
+class GaussianMixture:
+
+ def __init__(self, n_clusters, max_iterations=50):
+ self.n_clusters = n_clusters
+ self.pi = np.random.randint(0, 100, size=n_clusters)
+ self.pi = self.pi / np.nansum(self.pi)
+ self.sigma = {}
+ self.mu = {}
+ self.gamma = None
+ self.dim = 0
+ self.max_iterations = max_iterations
+
+ def init_param(self):
+ """
+ init parameters by k-means++
+ :return: None
+ """
+ k_means_model = KMeans(self.n_clusters, max_iterations=20)
+ k_means_model.fit(self.train_data)
+ if len(self.train_data.shape) == 1:
+ self.dim = 1
+ else:
+ self.dim = self.train_data.shape[1]
+ for i in range(self.n_clusters):
+ self.mu[i] = k_means_model.cluster_centers[i]
+ self.sigma[i] = np.ones(self.dim)
+ self.sigma[i] = np.diag(self.sigma[i])
+ self.gamma = np.empty([self.train_data.shape[0], self.n_clusters])
+
+ def point_probability(self, point):
+ """
+ calculate posterior distribution used in E step
+ :param point: a point in dataset
+ :return: the probability distribution of point belong to different clusters
+ """
+ ret = []
+ for i in range(self.n_clusters):
+ pt = point
+ pt = point.reshape(-1, 1)
+ mu = self.mu[i]
+ mu = mu.reshape(-1, 1)
+ sigma = self.sigma[i]
+ sigma = np.matrix(sigma)
+ D = self.dim
+ coef = 1 / ((2 * math.pi) ** (D / 2) * (np.linalg.det(sigma)) ** 0.5)
+ pw = -0.5 * np.matmul(np.matmul((pt - mu).T, sigma.I), (pt - mu))
+ result = float(coef * np.exp(pw) + np.exp(-200))
+ ret.append(result * self.pi[i])
+ return ret
+
+ def fit(self, train_data):
+ self.train_data = train_data
+ self.init_param()
+
+ # train with EM, described clearly in report
+ iteration = 0
+ for iteration in range(self.max_iterations):
+ # E step
+ for i in range(train_data.shape[0]):
+ temp = np.array(self.point_probability(train_data[i]))
+ self.gamma[i, :] = temp
+ self.gamma = self.gamma / self.gamma.sum(axis=1).reshape(-1, 1)
+
+ # M step
+ # update pi
+ self.pi = np.nansum(self.gamma, axis=0) / train_data.shape[0]
+ for label in range(self.n_clusters):
+ mu = np.zeros(self.dim)
+ sigma = np.zeros([self.dim, self.dim])
+ for i in range(train_data.shape[0]):
+ mu += self.gamma[i, label] * train_data[i]
+ point = train_data[i].reshape(-1, 1)
+ label_mu = self.mu[label].reshape(-1, 1)
+ rest = point - label_mu
+ sigma += self.gamma[i, label] * np.matmul(rest, rest.T)
+ # update mu
+ self.mu[label] = mu / np.nansum(self.gamma, axis=0)[label]
+ # update sigma
+ self.sigma[label] = sigma / np.nansum(self.gamma, axis=0)[label]
+ print("iterations for GMM: %d" % iteration)
+
+ def predict(self, test_data, show=False):
+ ret = []
+ for i in test_data:
+ prob_dist = self.point_probability(i)
+ label = prob_dist.index(max(prob_dist))
+ ret.append(label)
+ if show:
+ draw(self.n_clusters, np.array(ret), test_data)
+ return np.array(ret)
+
+
+class ClusteringAlgorithm:
+
+ def __init__(self, model="kmeans"):
+ if model == "kmeans":
+ self.ModelClass = KMeans
+ elif model == "gm":
+ self.ModelClass = GaussianMixture
+
+ def get_dis_mtx(self):
+ """
+ get the distance matrix for every two points to reduce double calculation
+ :return: distance matrix
+ """
+ n = self.train_data.shape[0]
+ distance = np.zeros((n, n))
+ for i in range(n):
+ distance[i] = (np.nansum((self.train_data[i] - self.train_data) ** 2, axis=1)) ** 0.5
+ self.distance = distance
+
+ def get_avg_dis(self, cluster_idx, point):
+ """
+ get the avg distance between point and points in cluster_idx
+ :param cluster_idx:
+ :param point:
+ :return:
+ """
+ if cluster_idx[0].shape[0] < 2:
+ return 0
+ cluster_idx = np.delete(cluster_idx, np.where(cluster_idx == point))
+ avg_dis = np.nanmean(self.distance[point, cluster_idx])
+ return avg_dis
+
+ def get_silhouette_coefficient(self, label, n_clusters):
+ """
+ calculate the silhouette coefficient for the clustering
+ :param label: label of train data
+ :param n_clusters: number of clusters
+ :return: silhouette coefficient for this clustering
+ """
+ n = self.train_data.shape[0]
+ cluster_idx = []
+ for k in range(n_clusters):
+ cluster_idx.append(np.where(label == k))
+ S = []
+ epsilon = 1e-9 # in case of division by zero
+ # cal silhouette coefficient for each point
+ for i in range(n):
+ ai = self.get_avg_dis(cluster_idx[label[i]], i)
+ bi = np.inf
+ for k in range(n_clusters):
+ if k == label[i]:
+ continue
+ bi = min(bi, self.get_avg_dis(cluster_idx[k], i))
+ S.append((bi - ai) / (max(bi, ai) + epsilon))
+ return np.nanmean(np.array(S))
+
+ def fit(self, train_data):
+ """
+ find the best number of clusters between 2 and 20 by silhouette coefficient
+ :param train_data: train_data
+ :return: None
+ """
+ self.train_data = train_data
+ self.get_dis_mtx()
+ max_sc = -1
+ for n_clusters in range(2, 10):
+ model = self.ModelClass(n_clusters)
+ model.fit(train_data)
+ label = model.predict(train_data)
+ sc = self.get_silhouette_coefficient(label, n_clusters)
+ if sc > max_sc:
+ self.model = model
+ max_sc = sc
+ print("num of clusters: %2d, silhouette_coefficient: %.6f" % (n_clusters, sc))
+
+ def predict(self, test_data, show=False):
+ return self.model.predict(test_data, show)
+
+
+def shuffle(*datas):
+ data = np.concatenate(datas)
+ label = np.concatenate([
+ np.ones((d.shape[0],), dtype=int) * i
+ for (i, d) in enumerate(datas)
+ ])
+ N = data.shape[0]
+ idx = np.arange(N)
+ np.random.shuffle(idx)
+ data = data[idx]
+ label = label[idx]
+ return data, label
+
+
+def data_1():
+ mean = (1, 2)
+ cov = np.array([[73, 0], [0, 22]])
+ x = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (16, -5)
+ cov = np.array([[21.2, 0], [0, 32.1]])
+ y = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (10, 22)
+ cov = np.array([[10, 5], [5, 10]])
+ z = np.random.multivariate_normal(mean, cov, (1000,))
+
+
+ # plt.scatter(x[:, 0], x[:, 1])
+ # plt.scatter(y[:, 0], y[:, 1])
+ # plt.scatter(z[:, 0], z[:, 1])
+ # plt.show()
+ data, _ = shuffle(x, y, z)
+ return (data, data), 3
+
+
+if __name__ == "__main__":
+ (data, _), n_clusters = data_1()
+ if len(sys.argv) > 1 and sys.argv[1] == "--kmeans":
+ model = KMeans(n_clusters)
+ model.fit(data)
+ model.predict(data, show=True)
+ elif len(sys.argv) > 1 and sys.argv[1] == "--gm":
+ model = GaussianMixture(n_clusters)
+ model.fit(data)
+ model.predict(data, show=True)
+ elif len(sys.argv) > 1 and sys.argv[1] == "--auto_kmeans":
+ model = ClusteringAlgorithm(model="kmeans")
+ model.fit(data)
+ model.predict(data, show=True)
+ elif len(sys.argv) > 1 and sys.argv[1] == "--auto_gm":
+ model = ClusteringAlgorithm(model="gm")
+ model.fit(data)
+ model.predict(data, show=True)
diff --git a/assignment-3/submission/17307130133/tester_demo.py b/assignment-3/submission/17307130133/tester_demo.py
new file mode 100644
index 0000000000000000000000000000000000000000..19ec0e8091691d4aaaa6b53dbb695fde9e826d89
--- /dev/null
+++ b/assignment-3/submission/17307130133/tester_demo.py
@@ -0,0 +1,117 @@
+import numpy as np
+import sys
+
+from source import KMeans, GaussianMixture
+
+
+def shuffle(*datas):
+ data = np.concatenate(datas)
+ label = np.concatenate([
+ np.ones((d.shape[0],), dtype=int)*i
+ for (i, d) in enumerate(datas)
+ ])
+ N = data.shape[0]
+ idx = np.arange(N)
+ np.random.shuffle(idx)
+ data = data[idx]
+ label = label[idx]
+ return data, label
+
+
+def data_1():
+ mean = (1, 2)
+ cov = np.array([[73, 0], [0, 22]])
+ x = np.random.multivariate_normal(mean, cov, (800,))
+
+ mean = (16, -5)
+ cov = np.array([[21.2, 0], [0, 32.1]])
+ y = np.random.multivariate_normal(mean, cov, (200,))
+
+ mean = (10, 22)
+ cov = np.array([[10, 5], [5, 10]])
+ z = np.random.multivariate_normal(mean, cov, (1000,))
+
+ data, _ = shuffle(x, y, z)
+ return (data, data), 3
+
+
+def data_2():
+ train_data = np.array([
+ [23, 12, 173, 2134],
+ [99, -12, -126, -31],
+ [55, -145, -123, -342],
+ ])
+ return (train_data, train_data), 2
+
+
+def data_3():
+ train_data = np.array([
+ [23],
+ [-2999],
+ [-2955],
+ ])
+ return (train_data, train_data), 2
+
+
+def test_with_n_clusters(data_fuction, algorithm_class):
+ (train_data, test_data), n_clusters = data_fuction()
+ model = algorithm_class(n_clusters)
+ model.fit(train_data)
+ res = model.predict(test_data)
+ assert len(
+ res.shape) == 1 and res.shape[0] == test_data.shape[0], "shape of result is wrong"
+ return res
+
+
+def testcase_1_1():
+ test_with_n_clusters(data_1, KMeans)
+ return True
+
+
+def testcase_1_2():
+ res = test_with_n_clusters(data_2, KMeans)
+ return res[0] != res[1] and res[1] == res[2]
+
+
+def testcase_2_1():
+ test_with_n_clusters(data_1, GaussianMixture)
+ return True
+
+
+def testcase_2_2():
+ res = test_with_n_clusters(data_3, GaussianMixture)
+ return res[0] != res[1] and res[1] == res[2]
+
+
+def test_all(err_report=False):
+ testcases = [
+ ["KMeans-1", testcase_1_1, 4],
+ ["KMeans-2", testcase_1_2, 4],
+ # ["KMeans-3", testcase_1_3, 4],
+ # ["KMeans-4", testcase_1_4, 4],
+ # ["KMeans-5", testcase_1_5, 4],
+ ["GMM-1", testcase_2_1, 4],
+ ["GMM-2", testcase_2_2, 4],
+ # ["GMM-3", testcase_2_3, 4],
+ # ["GMM-4", testcase_2_4, 4],
+ # ["GMM-5", testcase_2_5, 4],
+ ]
+ sum_score = sum([case[2] for case in testcases])
+ score = 0
+ for case in testcases:
+ try:
+ res = case[2] if case[1]() else 0
+ except Exception as e:
+ if err_report:
+ print("Error [{}] occurs in {}".format(str(e), case[0]))
+ res = 0
+ score += res
+ print("+ {:14} {}/{}".format(case[0], res, case[2]))
+ print("{:16} {}/{}".format("FINAL SCORE", score, sum_score))
+
+
+if __name__ == "__main__":
+ if len(sys.argv) > 1 and sys.argv[1] == "--report":
+ test_all(True)
+ else:
+ test_all()