diff --git a/assignment-1/submission/18307130252/README.md b/assignment-1/submission/18307130252/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1db817a2b1f5115a6dddf0ac95191a49fbd2fc86 --- /dev/null +++ b/assignment-1/submission/18307130252/README.md @@ -0,0 +1,191 @@ +# Assignment-1 KNN + +[toc] + + + +## 1. 数据生成 + +通过调用 `generate(mean, cov, prior, sz, num)` ,生成 3 个随机二维高斯分布。 + +其中,第 $i$ 个随机二维高斯分布服从参数 `[mean[i], cov[i]]` ,产生 `prior[i] * sz` 个随机点,并记录其所属 `label` 为 $i$ 。`num` 是生成数据可视化并保存时的 id 编号。 + +通过 `np.random.shuffle` 将生成的数据打乱,取前 80% 为训练集,后 20% 为测试集。 + +下图为 $N=3$ ,$sz = 3000$ 时的训练数据。其他参数见下表 。 + +| | label #1 | label #2 | label #3 | +| --------- | ---------------- | ---------------- | ---------------- | +| **mean** | [0, 0] | [-2, -2] | [3, -2] | +| **cov** | [[1, 0], [0, 1]] | [[1, 0], [0, 2]] | [[2, 0], [0, 1]] | +| **prior** | 0.3 | 0.3 | 0.4 | + +​ $N=3$时的训练数据 + +![](img/distribution.png) + +​ $N=3$时的训练数据 + +## 2. KNN + +### 模型训练 fit + +1. 先将输入的 `train_data` 和 `train_label` 打包后随机打乱,取 80% 为训练集, 20% 为开发集。 + +2. 遍历所有可能的 $K$ 值: + + + 找到训练集中距离开发集中的第 $i$ 个样本点最近的 $K$ 个点。 + + + 采用投票规则,取 $K$ 个点中出现次数最多的 $label$ 为当前点的 $label$ 。 + + 计算当前 $K$ 值下的准确率。 + +3. 保存准确率最高的 $K$ 值为当前模型的参数 $K$ 。 + + + +### 模型预测 predict + +根据当前模型的的参数 $K$ ,找到所有训练数据离预测点最近的 $K$ 个点,同样采用投票规则计算出预测的 $label$ 。 + + + +## 3. 探究性实验及数据分析 + +### K 值范围及距离函数比较 + +假设 $K$ 值范围为 $[0..K_{max}]$, 训练集大小为 $train_{size}$,开发集大小为 $dev_{size}$。则模型训练的复杂度为 $O(K_{max} * train_{size} * dev_{size})$ 。 当测试集和开发集都很大时, $K_{max}$ 对模型训练的效率影响极大。 + +如果 $K_{max}$ 过大,模型训练效率将大幅降低,但如果 $K_{max}$ 一开始就取得过小,就会使得模型训练出的 $K_{neighbors}$ 不够好,因此,在这里先选取三种不同的距离函数:曼哈顿距离、欧几里得距离和切比雪夫距离进行实验,得到在不同距离函数的定义下,$K$ 值对模型准确率的影响如下: + +![pic2](img/no_cross_1024.png) + + + +在当前数据集下,每种距离函数在不同的 $K$ 下,最高都可以达到超过 90% 的准确率,但各自的最优 $K$ 值不同。其中,曼哈顿距离的最优 $K$ 值最小,然后是欧几里得距离,切比雪夫距离的最优 $K$ 值在 1000 左右,因此在本次实验中,将统一使用曼哈顿距离函数,并取 $K_{max} = 200$ 。 + + + +### K折交叉验证对结果的影响 + +相关参数设置: + ++ KNN.cross_validation = 0/1,设置是否开启交叉验证。 ++ KNN.K_folder 设置交叉验证的折数 + +在本次实验中,采取 5 折交叉验证。 + +![pic](img/k-folder_comparison.png) + +根据上表所示实验结果,在不同的数据集上,交叉验证和简单的划分训练集和开发集的准确率各有优劣。但总体来看,交叉验证的准确率略高于简单划分训练集和开发集的方法。 + +考虑到准确率提升不是特别明显,而在交叉验证中,$K$ 值远大于简单划分,并且需要多次验证,效率过低,因此在之后的实验中,仍采取简单划分的训练模式。 + + + + + +### 高斯分布参数对结果的影响 + +#### 讨论高斯分布参数 $\mu$ 对结果的影响 + +测试数据分布和数据生成所用参数如下(只改了 $\mu$, 其他参数仍保持不变): + +![](img/fig_test20.png) + +​ $\mu0 = [0, 0] \space \mu1 = [-1, -1] \space \mu2 = [1, 1]$ + +​ $acc = 65.67\%$ + +![](img/fig_test21.png) + +​ $\mu0 = [0, 0] \space \mu1 = [-3, -3] \space \mu2 = [3, 3]$ + +​ $acc = 96.17\%$ + +![](img/fig_test22.png) + +​ $\mu0 = [0, 0] \space \mu1 = [-1, -1] \space \mu2 = [1.5, -1]$ + +​ $acc = 67.67\%$ + +![](img/fig_test23.png) + +​ $\mu0 = [0, 0] \space \mu1 = [-2, -2] \space \mu2 = [3, -2]$ + +​ $acc = 91.83\%$ + +由此可知,数据分布的形态(直线分布或三角分布)对模型预测准确率的影响不大。但各数据分布的重合程度会极大地影响模型预测准确率,重合度越高,准确率越低。 + + + +#### 讨论高斯分布参数 $\sigma$ 对结果的影响 + +测试数据分布和数据生成所用参数如下(只改了 $\sigma$, 其他参数仍保持不变): + +![](img/fig_test30.png) +$$ +\sigma_0 = [[1, 0], [0, 1]] \\ +\sigma_1 = [[1, 0], [0, 2]] \\ +\sigma_2 = [[2, 0], [0, 1]] +$$ +​ $acc = 88.83\%$ + +![](img/fig_test31.png) +$$ +\sigma_0 = [[2, 0], [0, 2]] \\\sigma_1 = [[2, 0], [0, 4]] \\\sigma_2 = [[4, 0], [0, 2]] +$$ +​ $acc = 78.67\%$ + +![](img/fig_test32.png) +$$ +\sigma_0 = [[10, 5], [5, 10]] \\\sigma_1 = [[5, 3], [3, 2]] \\\sigma_2 = [[4, 2], [2, 8]] +$$ +​ $acc = 73.67\%$ + +![](img/fig_test33.png) +$$ +\sigma_0 = [[5, 2.5], [2.5, 5]] \\\sigma_1 = [[2.5, 1.5], [1.5, 1]] \\\sigma_2 = [[2, 1], [1, 4]] +$$ +​ $acc = 84.00\%$ + +由此可知,数据分布的形态(长条形分布或圆装分布)对模型预测准确率的影响不大。但各数据分布的密集程度会较大地影响模型预测准确率,分布越密集,准确率越高。 + + + +#### 讨论高斯分布的比例对结果的影响 + +测试数据分布和数据生成所用参数如下(只改了 $prior$, 其他参数仍保持不变): + +![](img/fig_test40.png) +$$ +prior_0 = .3 \\ +prior_1 = .3 \\ +prior_2 = .4 +$$ +​ $acc = 90\%$ + +![](img/fig_test41.png) +$$ +prior_0 = .2 \\ +prior_1 = .2 \\ +prior_2 = .8 +$$ +​ $acc = 91.25\%$ + +![](img/fig_test42.png) +$$ +prior_0 = .1 \\ +prior_1 = .5 \\ +prior_2 = .4 +$$ +​ $acc = 93.16\%$ + +根据实验结果,各分布的比例对模型预测的结果影响不大。猜测可能是因为修改分布的比例对分布间的重合度影响不大。 + + + +### 总结 + +对 3 个高斯分布做 KNN 模型预测,分布间的重合度以及分布本身的密集度对模型预测准确度的影响较大。总的来说,分布间重合度越低,模型预测准确度越高,分布本身越密集,模型预测准确度越高。 + +距离函数、训练方法f对模型参数 $K$ 的训练结果有一定影响,但对预测准确度的影响不大。 \ No newline at end of file diff --git a/assignment-1/submission/18307130252/img/distribution.png b/assignment-1/submission/18307130252/img/distribution.png new file mode 100644 index 0000000000000000000000000000000000000000..27d6931c7ac2ee0422c23665751ec04d0d06fa21 Binary files /dev/null and b/assignment-1/submission/18307130252/img/distribution.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test20.png b/assignment-1/submission/18307130252/img/fig_test20.png new file mode 100644 index 0000000000000000000000000000000000000000..b55b1d6bcdf82554db31fb65edd59030fe2dfe29 Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test20.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test21.png b/assignment-1/submission/18307130252/img/fig_test21.png new file mode 100644 index 0000000000000000000000000000000000000000..23f9588279bba23fab850b8665e21efe54b52b37 Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test21.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test22.png b/assignment-1/submission/18307130252/img/fig_test22.png new file mode 100644 index 0000000000000000000000000000000000000000..9f56d0dcfa8847e03a6c75e7901c69f19e7da4d9 Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test22.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test23.png b/assignment-1/submission/18307130252/img/fig_test23.png new file mode 100644 index 0000000000000000000000000000000000000000..6896d8e9c568f859b026ef2280e7ea7c8c96edbd Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test23.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test30.png b/assignment-1/submission/18307130252/img/fig_test30.png new file mode 100644 index 0000000000000000000000000000000000000000..dfd78ae7b402bf6a5b4b00bab88084e5bef60e58 Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test30.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test31.png b/assignment-1/submission/18307130252/img/fig_test31.png new file mode 100644 index 0000000000000000000000000000000000000000..bc393e5723ccfdedc2c82865280e9b1341125b5e Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test31.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test32.png b/assignment-1/submission/18307130252/img/fig_test32.png new file mode 100644 index 0000000000000000000000000000000000000000..bc07432dbcc84583d0829ece2446cdc07205c6fb Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test32.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test33.png b/assignment-1/submission/18307130252/img/fig_test33.png new file mode 100644 index 0000000000000000000000000000000000000000..88aba32b287279ff7fb530c8fa28a0249690dc94 Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test33.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test40.png b/assignment-1/submission/18307130252/img/fig_test40.png new file mode 100644 index 0000000000000000000000000000000000000000..8d7f3ff1b3496c6951a086b809db0b116efaf1d3 Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test40.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test41.png b/assignment-1/submission/18307130252/img/fig_test41.png new file mode 100644 index 0000000000000000000000000000000000000000..2ba24db8edaaa6f1bda47868bf3c4c4aa0589656 Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test41.png differ diff --git a/assignment-1/submission/18307130252/img/fig_test42.png b/assignment-1/submission/18307130252/img/fig_test42.png new file mode 100644 index 0000000000000000000000000000000000000000..d683a39ac69f90b67e20302cba406b94955bc34c Binary files /dev/null and b/assignment-1/submission/18307130252/img/fig_test42.png differ diff --git a/assignment-1/submission/18307130252/img/k-folder_comparison.png b/assignment-1/submission/18307130252/img/k-folder_comparison.png new file mode 100644 index 0000000000000000000000000000000000000000..e0c0a5722c7155de4b65c0b56e0bde916a99937a Binary files /dev/null and b/assignment-1/submission/18307130252/img/k-folder_comparison.png differ diff --git a/assignment-1/submission/18307130252/img/no_cross_1024.png b/assignment-1/submission/18307130252/img/no_cross_1024.png new file mode 100644 index 0000000000000000000000000000000000000000..e8edc92d435c124b6fa48e551cf6544e9621a7d5 Binary files /dev/null and b/assignment-1/submission/18307130252/img/no_cross_1024.png differ diff --git a/assignment-1/submission/18307130252/source.py b/assignment-1/submission/18307130252/source.py new file mode 100644 index 0000000000000000000000000000000000000000..73921fee31511653305a3ae9a70c8e4b29d72565 --- /dev/null +++ b/assignment-1/submission/18307130252/source.py @@ -0,0 +1,298 @@ +import numpy as np +import matplotlib.pyplot as plt +# from tqdm import tqdm + +# Provide several different types of distances: +# 0 for Manhattan distance +# 1 for Euclid distance +# 2 for Chebyshev distance +dis_type = 1 + +def draw_acc_pic(manhattan_acc, euclid_acc, chebyshev_acc, i): + plt.xlabel("K value") + plt.ylabel("accuracy") + print(manhattan_acc) + print(euclid_acc) + print(chebyshev_acc) + plt.plot(manhattan_acc, color = 'r', label="manhattan_distance") + plt.plot(euclid_acc, color = 'g', label="euclid_distance") + plt.plot(chebyshev_acc, color = 'b', label="chebyshev_acc") + plt.savefig("acc_pic" + str(i) + ".jpg") + plt.close() + +def draw_train_data(dataset, num): + X = [[], [], []] + Y = [[], [], []] + + sz = len(dataset) + for i in range(0, sz): + X[dataset[i][1]].append(dataset[i][0][0]) + Y[dataset[i][1]].append(dataset[i][0][1]) + plt.scatter(X[0], Y[0], s = 10, c = "r") + plt.scatter(X[1], Y[1], s = 10, c = "g") + plt.scatter(X[2], Y[2], s = 10, c = "b") + plt.savefig("fig_train" + str(num) + ".jpg") + plt.close() + +def draw_test_data(dataset, num): + + X = [[], [], []] + Y = [[], [], []] + + sz = len(dataset) + for i in range(0, sz): + X[dataset[i][1]].append(dataset[i][0][0]) + Y[dataset[i][1]].append(dataset[i][0][1]) + plt.scatter(X[0], Y[0], s = 10, c = "r") + plt.scatter(X[1], Y[1], s = 10, c = "g") + plt.scatter(X[2], Y[2], s = 10, c = "b") + plt.savefig("fig_test" + str(num) + ".jpg") + plt.close() + +class KNN: + + def __init__(self): + self.K_neighbors = -1 + self.K_max = 200 + self.K_folder = 5 + self.train_data = [] + self.train_label = [] + self.dataset = [] + self.cross_validation = 0 + pass + + def cross_validation_fit(self, train_data, train_label): + dataset = list(zip(train_data, train_label)) + self.dataset = dataset + data_size = len(dataset) + np.random.shuffle(dataset) + train_size = (int)(data_size * 0.8) + dev_size = data_size - train_size + best_acc = -1 + acc_array = [] + # for k in tqdm(range(1, self.K_max)): + for k in range(1, self.K_max): + tag = 0 + for fold in range(1, self.K_folder): + dev_begin = dev_size * fold + dev_end = min(data_size, dev_size * (fold + 1)) + devset = dataset[dev_begin:dev_end] + if dev_end < data_size: + trainset = np.vstack([ + dataset[:dev_begin], + dataset[dev_end:] + ]) + else: + trainset = dataset[:dev_begin] + for (dev_data, dev_label) in devset: + dis = [] + for (train_data, train_label) in trainset: + dis.append((distance(dev_data, train_data), train_label)) + dis.sort(key= lambda k:k[0]) + dis = dis[:k] + dis = [(x) for (_, x) in dis] + tag += (int)(max(dis, key = dis.count) == dev_label) + if best_acc < tag: + best_acc = tag + self.K_neighbors = k + acc_array.append(tag/(self.K_folder * dev_size)) + + # print("cross_validation: ", self.K_neighbors) + return acc_array + + def simple_validation_fit(self, train_data, train_label): + dataset = list(zip(train_data, train_label)) + self.dataset = dataset + data_size = len(dataset) + np.random.shuffle(dataset) + train_size = (int)(data_size * 0.8) + dev_size = data_size - train_size + trainset = dataset[ : train_size] + devset = dataset[train_size : ] + best_acc = -1 + acc_array = [] + + # for k in tqdm(range(1, self.K_max)): + + for k in range(1, self.K_max): + tag = 0 + for (dev_data, dev_label) in devset: + dis = [] + for (train_data, train_label) in trainset: + dis.append((distance(dev_data, train_data), train_label)) + dis.sort(key= lambda k:k[0]) + dis = dis[:k] + dis = [(x) for (_, x) in dis] + tag += (int)(max(dis, key = dis.count) == dev_label) + if best_acc < tag: + best_acc = tag + self.K_neighbors = k + acc_array.append(tag/dev_size) + + # print("simple_validation: ", self.K_neighbors) + return acc_array + + def fit(self, train_data, train_label): + self.train_data = train_data + self.train_label = train_label + if self.cross_validation == 0: + return self.simple_validation_fit(train_data, train_label) + else: + return self.cross_validation_fit(train_data, train_label) + + + def predict(self, test_data): + if self.K_neighbors == -1: + print("Error: KNN.fit(train_data, train_label) must be called before KNN.predict(test_data).") + return None + pred_label = [] + for cur in test_data: + dis = [] + for (train_data, train_label) in self.dataset: + dis.append((distance(cur, train_data), train_label)) + dis.sort(key= lambda k:k[0]) + dis = dis[:self.K_neighbors] + dis = [(x) for (_, x) in dis] + pred_label.append(max(dis, key = dis.count)) + return pred_label + +def distance(A, B): + if dis_type == 0: + return manhattan_distance(A, B) + elif dis_type == 1: + return euclid_distance(A, B) + elif dis_type == 2: + return chebyshev_distance(A, B) + else: + print("dis_type is not implemented.") + +def chebyshev_distance(A, B): + return max(abs(A[0] - B[0]), abs(A[1] - B[1])) + +def manhattan_distance(A, B): + return abs(A[0] - B[0]) + abs(A[1] - B[1]) + +def euclid_distance(A, B): + return (A[0] - B[0]) * (A[0] - B[0]) + (A[1] - B[1]) * (A[1] - B[1]) + + + +def generate(mean, cov, prior, sz, num): + tot = [int(sz * prior[i]) for i in range(3)] + dataset = [(list(x), 0) for x in np.random.multivariate_normal(mean[0], cov[0], size=tot[0])] + dataset.extend([(list(x), 1) for x in np.random.multivariate_normal(mean[1], cov[1], size=tot[1])]) + dataset.extend([(list(x), 2) for x in np.random.multivariate_normal(mean[2], cov[2], size=tot[2])]) + dataset_size = len(dataset) + np.random.shuffle(dataset) + train_size = (int)(dataset_size * 0.8) + train_data, train_label = zip(*(dataset[ : train_size])) + test_data, test_label = zip(*(dataset[train_size : ])) + + + draw_train_data(dataset[ : train_size], num) + draw_test_data(dataset[train_size : ], num) + + np.save("data.npy", ((train_data, train_label), (test_data, test_label))) + + +def read(): + (train_data, train_label), (test_data, test_label) = np.load("data.npy",allow_pickle=True) + return (train_data, train_label), (test_data, test_label) + +if __name__ == "__main__": + + mean = [[0, 0], [-2, -2], [3, -2]] + cov = [[[1, 0], [0, 1]], [[1, 0], [0, 2]], [[2, 0], [0, 1]]] + prior = [0.3, 0.3, 0.4] + sz = 3000 + generate(mean, cov, prior, sz, 1) + (train_data, train_label), (test_data, test_label) = read() + + model = KNN() + + dis_type = 0 + manhattan_acc = model.fit(train_data, train_label) + dis_type = 1 + euclid_acc = model.fit(train_data, train_label) + dis_type = 2 + chebyshev_acc = model.fit(train_data, train_label) + draw_acc_pic(manhattan_acc, euclid_acc, chebyshev_acc, i) + res = model.predict(test_data) + + dis_type = 0 + + model = KNN() + normal_acc = [] + kfolder_acc = [] + + for i in range(1, 6): + generate(mean, cov, prior, sz, 10 + i) + (train_data, train_label), (test_data, test_label) = read() + model.cross_validation = 0 + model.fit(train_data, train_label) + res = model.predict(test_data) + print("id: ", i, " normal_acc: ", np.mean(np.equal(res, test_label))) + normal_acc.append(np.mean(np.equal(res, test_label))) + model.cross_validation = 1 + model.fit(train_data, train_label) + res = model.predict(test_data) + print("id: ", i, " kfolder_acc: ", np.mean(np.equal(res, test_label))) + kfolder_acc.append(np.mean(np.equal(res, test_label))) + + print(normal_acc) + print(kfolder_acc) + + mean = [ + [[0, 0], [-1, -1], [1, 1]], + [[0, 0], [-3, -3], [3, 3]], + [[0, 0], [-1, -1], [1.5, -1]], + [[0, 0], [-2, -2], [3, -2]] + ] + cov = [[[1, 0], [0, 1]], [[1, 0], [0, 2]], [[2, 0], [0, 1]]] + prior = [0.3, 0.3, 0.4] + sz = 3000 + for i in range(0, 4): + generate(mean[i], cov, prior, sz, 20 + i) + (train_data, train_label), (test_data, test_label) = read() + model.cross_validation = 0 + model.fit(train_data, train_label) + res = model.predict(test_data) + print("id: ", i, " normal_acc: ", np.mean(np.equal(res, test_label))) + normal_acc.append(np.mean(np.equal(res, test_label))) + print(normal_acc) + + mean = [[0, 0], [-2, -2], [3, -2]] + cov = [ + [[[1, 0], [0, 1]], [[1, 0], [0, 2]], [[2, 0], [0, 1]]], + [[[2, 0], [0, 2]], [[2, 0], [0, 4]], [[4, 0], [0, 2]]], + [[[10, 5], [5, 10]], [[5, 3], [3, 2]], [[4, 2], [2, 8]]], + [[[5, 2.5], [2.5, 5]], [[2.5, 1.5], [1.5, 1]], [[2, 1], [1, 4]]] + ] + prior = [0.3, 0.3, 0.4] + sz = 3000 + for i in range(0, 4): + generate(mean, cov[i], prior, sz, 30 + i) + (train_data, train_label), (test_data, test_label) = read() + model.cross_validation = 0 + model.fit(train_data, train_label) + res = model.predict(test_data) + print("id: ", i, " kfolder_acc: ", np.mean(np.equal(res, test_label))) + kfolder_acc.append(np.mean(np.equal(res, test_label))) + + mean = [[0, 0], [-2, -2], [3, -2]] + cov = [[[1, 0], [0, 1]], [[1, 0], [0, 2]], [[2, 0], [0, 1]]] + prior = [ + [0.3, 0.3, 0.4], + [0.2, 0.2, 0.8], + [0.1, 0.5, 0.4] + ] + sz = 3000 + for i in range(0, 3): + generate(mean, cov, prior[i], sz, 40 + i) + (train_data, train_label), (test_data, test_label) = read() + model.cross_validation = 0 + model.fit(train_data, train_label) + res = model.predict(test_data) + print("id: ", i, " normal_acc: ", np.mean(np.equal(res, test_label))) + normal_acc.append(np.mean(np.equal(res, test_label))) + print(normal_acc) \ No newline at end of file