diff --git a/assignment-1/submission/19307130062/README.md b/assignment-1/submission/19307130062/README.md new file mode 100644 index 0000000000000000000000000000000000000000..275c24920b61f9b6a030f69350e8a8d9b39bee74 --- /dev/null +++ b/assignment-1/submission/19307130062/README.md @@ -0,0 +1,94 @@ +# Assignment 1: KNN 实现 + +- **姓名:高庆麾** +- **学号:19307130062** + + + +## 工作简要介绍 + +- 实现了基本的 KNN 算法 + - 提供了多种距离度量选择 + - 曼哈顿距离 + - 欧几里得距离 + - 切比雪夫距离 + - $L_p$ 范数,其中 $0 < p \leqslant 100$ + - 提供了多种数据预处理方式 + - min-max 归一化 + - z-score 归一化 + - 不进行预处理 + - **提供两种底层算法可供选择** + - 暴力法 + - 对验证集中的每条数据,求出训练集中所有数据到它的距离并排序,取前 $K$ 小的距离对应的点的类别中出现次数最多者作为该数据的预测类别 + - **KD树** + - 每次将空间划分为两半,由此形成树状结构,每个子树对应一个空间中的训练数据点。每次在树上暴力遍历,同时维护当前找到的最近的 $K$ 个点,如果发现一个子树里最近点的距离也比 $K$ 个已经找到的点中的最大距离还要大,那么就无需再进入该子树进行遍历 + - 对于 $K$ 个已经找到的最近点,可以用一个优先队列(大根堆)维护,每次判断子树最近点和堆顶的大小,如果堆顶更小,则子树无需再遍历 + - 由于不能使用外部库,在代码中手写了大根堆类、KD树类,由于时间相对紧张,功能上尚不是非常通用 + - 在多种样本点分布上进行了测试 + - 四个群落分布在方形上 + - 多个群落分布在圆上 + - 编写了可自动绘制丰富实验数据图像的代码 + - 圆上不同群落数量的预测准确率对比 + - KD树算法示意图 + + + +## 实验结果 + +### 多群落样本在圆上的分布 + +样本点的生成方法是,在半径为 $r$ 的圆上均匀放置高斯分布中心点,每个中心生成 $300$ 个数据点,形成数据群落。下面的实验对比了 $r$ 的不同(也就是群落距离的不同)对 KNN 算法准确率的影响。 + + + +#### r = 2.0 + +第一列图片上方的 $n$ 代表群落数,第二列图片下方的 $\mathrm{acc },\ \mathrm{k}$ 分别代表该实验情况下的准确率和最优超参数 $K$ 的取值。为了更加清晰地表现预测结果,第二列图片中,训练集内的样本点颜色变浅,测试集内的样本点如果预测正确,显示代表其类别的颜色,否则显示为黑色,意味着该样本点预测错误。可以对照每列上下两幅图片,观察黑色点原本所在的位置,得到一些有意义的结论。![1](img/1.png) + +可以看到,黑色点(预测错误的点)大都分布在群落交界的位置,或是一些点数相对稀少的偏僻位置,这正符合KNN算法的预测原理。由于群落数不断增加且距离过近,因此KNN预测的正确率迅速下降。 + +#### r = 3.0![2](img/2_2.png) + +可以看到,和上一次实验对比,我们把群落间的距离(也就是单位圆半径)增大了一点,由此群落之间分布更为疏远。在这样的情况下,KNN算法预测的准确率有了明显上升。大部分黑色点仍分布在各群落的交界处。 + +而随着数据群落数量的增大,KNN算法更倾向于增大 $K$ 值以获得更好的预测效果。 + +#### r = 4.0![3](img/3.png) + +我们再次把群落间的距离拉大,前两组群落数较少的情况下,由于群落间几乎没有相交和干涉,甚至出现了预测准确率为 $100\%$ 的结果。在后两组群落分布数提高的情况下,由于相交和重叠较少,一些容易预测错误的点没有被选入测试集,因此没有造成很大的准确率损失。 + + + +#### 总结 + +在样本分布密集、相互干扰较大的情况下,KNN算法的预测准确率会受到比较显著的影响,且预测错误的样本点大都分布在不同类别样本点的边界处,以及一些样本点较为稀疏杂乱的位置。 + + + +### 不同距离度量的影响 + +#### 样本数较多(每群落 $100$ 个点) + +从第二行向下依次是切比雪夫距离、曼哈顿距离、欧几里得距离、$L_{0.8}$ 范数、$L_{1.3}$ 范数 + +![6](img/6.png) + +看起来差别并不太大,可能是样本数过大,抑制了不同距离度量对结果的影响,也可能是一些测试点的位置确实过于偏僻,无论采用什么距离度量都无法预测正确。 + + + +#### 样本数较少(每群落 $30$ 个点)![6_3](img/6_3.png) + +我们减少样本点的数量,为了弥补数据变得稀疏的问题,将群落之间的距离拉近。可以看到样本点较少时仍然没有太大差别,尽管曼哈顿距离看上去有几个预测点的优势。这说明不同距离度量(至少是一些比较常见的)对预测结果的影响确实不大。 + +### KD树的正确性测试![KDT](img/KDT.png) + +可以看到KD树算法预测的结果在图上也是比较正确的,这说明了实现的KD树算法的正确性。 + + + +### KD树算法示意图 + +![7](img/7.png) + +我们每次找到数据方差最大的一维,将其从中位数处分开为两部分,递归构造子树。这样可以看到样本空间被分为很多矩形部分,如图所示。我们可以通过这些矩形在 KD 树上的搜索进行剪枝,从而提高寻找 K 临近点的效率。 \ No newline at end of file diff --git a/assignment-1/submission/19307130062/img/1 b/assignment-1/submission/19307130062/img/1 new file mode 100644 index 0000000000000000000000000000000000000000..460701a919885030077d0048071ad69101ed7952 Binary files /dev/null and b/assignment-1/submission/19307130062/img/1 differ diff --git a/assignment-1/submission/19307130062/img/1.png b/assignment-1/submission/19307130062/img/1.png new file mode 100644 index 0000000000000000000000000000000000000000..ea4b0220e207f4e530fe5d88d0830ede9375f2f8 Binary files /dev/null and b/assignment-1/submission/19307130062/img/1.png differ diff --git a/assignment-1/submission/19307130062/img/2 b/assignment-1/submission/19307130062/img/2 new file mode 100644 index 0000000000000000000000000000000000000000..fc12c932e138b26bbfe906523b7cba2231a788ac Binary files /dev/null and b/assignment-1/submission/19307130062/img/2 differ diff --git a/assignment-1/submission/19307130062/img/2.png b/assignment-1/submission/19307130062/img/2.png new file mode 100644 index 0000000000000000000000000000000000000000..8396d6afd62a14f6857c15c4858c87a34540567f Binary files /dev/null and b/assignment-1/submission/19307130062/img/2.png differ diff --git a/assignment-1/submission/19307130062/img/2_2.png b/assignment-1/submission/19307130062/img/2_2.png new file mode 100644 index 0000000000000000000000000000000000000000..0a39ea6dbca184a4fc17dbd750d25dab1c14a01d Binary files /dev/null and b/assignment-1/submission/19307130062/img/2_2.png differ diff --git a/assignment-1/submission/19307130062/img/3 b/assignment-1/submission/19307130062/img/3 new file mode 100644 index 0000000000000000000000000000000000000000..136912d5a39663e95bef7179e0d83bfa4cb46e96 Binary files /dev/null and b/assignment-1/submission/19307130062/img/3 differ diff --git a/assignment-1/submission/19307130062/img/3.png b/assignment-1/submission/19307130062/img/3.png new file mode 100644 index 0000000000000000000000000000000000000000..c4f8d91de8475590823f051fdc137c57ba046a82 Binary files /dev/null and b/assignment-1/submission/19307130062/img/3.png differ diff --git a/assignment-1/submission/19307130062/img/4 b/assignment-1/submission/19307130062/img/4 new file mode 100644 index 0000000000000000000000000000000000000000..bfbb302218cb972ba760b524d0612bde391a6035 Binary files /dev/null and b/assignment-1/submission/19307130062/img/4 differ diff --git a/assignment-1/submission/19307130062/img/6.png b/assignment-1/submission/19307130062/img/6.png new file mode 100644 index 0000000000000000000000000000000000000000..47cea6bbb7e6f493ad5fe7211086e345024fa1d0 Binary files /dev/null and b/assignment-1/submission/19307130062/img/6.png differ diff --git a/assignment-1/submission/19307130062/img/6_.png b/assignment-1/submission/19307130062/img/6_.png new file mode 100644 index 0000000000000000000000000000000000000000..58283b746816b40f10ea405e3a6ac89dde63d315 Binary files /dev/null and b/assignment-1/submission/19307130062/img/6_.png differ diff --git a/assignment-1/submission/19307130062/img/6_2.png b/assignment-1/submission/19307130062/img/6_2.png new file mode 100644 index 0000000000000000000000000000000000000000..2b3473bef0c2b16a76d50322fd73a259f5c471a7 Binary files /dev/null and b/assignment-1/submission/19307130062/img/6_2.png differ diff --git a/assignment-1/submission/19307130062/img/6_3.png b/assignment-1/submission/19307130062/img/6_3.png new file mode 100644 index 0000000000000000000000000000000000000000..7903c2b4032a38c5e45eedd6a65ca45d0f5169e1 Binary files /dev/null and b/assignment-1/submission/19307130062/img/6_3.png differ diff --git a/assignment-1/submission/19307130062/img/7.png b/assignment-1/submission/19307130062/img/7.png new file mode 100644 index 0000000000000000000000000000000000000000..41e6905f3756d308ece231ce3a7449757321a276 Binary files /dev/null and b/assignment-1/submission/19307130062/img/7.png differ diff --git a/assignment-1/submission/19307130062/img/KDT.png b/assignment-1/submission/19307130062/img/KDT.png new file mode 100644 index 0000000000000000000000000000000000000000..5d8b3d863d184798bbf7d2472450fbef84828f32 Binary files /dev/null and b/assignment-1/submission/19307130062/img/KDT.png differ diff --git a/assignment-1/submission/19307130062/img/chebyshev_3.png b/assignment-1/submission/19307130062/img/chebyshev_3.png new file mode 100644 index 0000000000000000000000000000000000000000..368e7c791c6b2c685df51d83a449f1f5027c126c Binary files /dev/null and b/assignment-1/submission/19307130062/img/chebyshev_3.png differ diff --git a/assignment-1/submission/19307130062/img/manhattan_2.png b/assignment-1/submission/19307130062/img/manhattan_2.png new file mode 100644 index 0000000000000000000000000000000000000000..f8e855629b5c92806580ddf4aae220aa5819205c Binary files /dev/null and b/assignment-1/submission/19307130062/img/manhattan_2.png differ diff --git a/assignment-1/submission/19307130062/img/manhattan_3.png b/assignment-1/submission/19307130062/img/manhattan_3.png new file mode 100644 index 0000000000000000000000000000000000000000..67ca90ad58e6e0a94b99a3914e0cb94960684992 Binary files /dev/null and b/assignment-1/submission/19307130062/img/manhattan_3.png differ diff --git a/assignment-1/submission/19307130062/source.py b/assignment-1/submission/19307130062/source.py new file mode 100644 index 0000000000000000000000000000000000000000..5dcd5309efd705738415b1ca54cdc83e78f04676 --- /dev/null +++ b/assignment-1/submission/19307130062/source.py @@ -0,0 +1,527 @@ +import numpy as np +import matplotlib.pyplot as plt +import numpy.random as rd + +class Heap: + def __init__(self): + self.node = [0] + self.label = [0] + self.num = 0 + self.tail = 0 + + def top(self): + return self.node[1] + + def pop(self): + """ + 弹出堆内元素,我们先把最后一个元素放到堆顶,再自顶向下交换元素,使堆保持大根性质 + """ + self.node[1] = self.node[self.num] + self.label[1] = self.label[self.num] + self.num -= 1 + now = 1 + while True: + nex, mx = -1, 0 + if now << 1 <= self.num: + if self.node[now << 1] > mx or nex == -1: + mx = self.node[now << 1] + nex = now << 1 + if now << 1 | 1 <= self.num: + if self.node[now << 1 | 1] > mx or nex == -1: + mx = self.node[now << 1 | 1] + nex = now << 1 | 1 + if nex == -1 or mx < self.node[now]: + return + else: + self.node[now], self.node[nex] = self.node[nex], self.node[now] + self.label[now], self.label[nex] = self.label[nex], self.label[now] + now = nex + + def push(self, x, y): + """ + 向堆中插入元素,自底向上交换元素,使堆保持大根性质 + """ + self.num += 1 + if self.num > self.tail: + self.node.append(x) + self.label.append(y) + self.tail += 1 + else: + self.node[self.num] = x + self.label[self.num] = y + now = self.num + while True: + if now > 1 and self.node[now >> 1] < self.node[now]: + self.node[now >> 1], self.node[now] = self.node[now], self.node[now >> 1] + self.label[now >> 1], self.label[now] = self.label[now], self.label[now >> 1] + now >>= 1 + else: + return + + def size(self): + return self.num + + def clear(self): + self.node = [0] + self.label = [0] + self.num = 0 + self.tail = 0 + +class node: + def __init__(self, dim): + self.father = None + self.left = None + self.right = None + self.label = None + self.mi = np.zeros(dim) + self.mx = np.zeros(dim) + self.data = None + for i in range(dim): + self.mi[i], self.mx[i] = 2e18, -2e18 + +class KDT: + + + def __init__(self, data, label, dim, u, d, l, r, visualize = False): + self.node_count = 0 + self.dim = dim + self.root = self.new_node() + self.build(data, label, self.root, u, d, l, r, visualize) + self.heap = Heap() + + def new_node(self): + self.node_count += 1 + return node(self.dim) + + def build(self, data, label, now, u, d, l, r, visualize): + """ 建树函数 + u, d, l, r, visualize 用于控制绘制 KDT 示意图,不影响算法实际运行 + data, label 分别表示当前子树中包含的数据和标签 + now 表示当前子树对应的 KDT 上结点 + """ + split_dim = np.argmax(np.array([np.var(data[:, dim]) for dim in range(self.dim)])) # 找出方差最大的一维,作为分割维 + split_value = np.median(data[:, split_dim]) # 找出分割维的中位数 + data_left, label_left = data[data[:, split_dim] < split_value], label[data[:, split_dim] < split_value] # 数据的左半部分 + data_right, label_right = data[data[:, split_dim] > split_value], label[data[:, split_dim] > split_value] # 数据的右半部分 + data_mid, label_mid = data[data[:, split_dim] == split_value], label[data[:, split_dim] == split_value] # 当前结点上的数据 + now.data, now.label = data_mid, label_mid + + if visualize: # 可视化部分,可以画出 KDT 分割空间的示意图 + if split_dim == 0: + plt.plot([split_value, split_value], [d, u]) + elif split_dim == 1: + plt.plot([l, r], [split_value, split_value]) + + if data_mid.size: + for dim in range(data_mid.shape[1]): + now.mx[dim] = max(data_mid[:, dim]) + now.mi[dim] = min(data_mid[:, dim]) + if data_left.size != 0: # 递归建树,求出每个子树中每一维的极值,用于之后的距离下界判断和剪枝 + now.left = self.new_node() + now.left.father = now + + if split_dim == 0: + mx_left, mi_left = self.build(data_left, label_left, now.left, u, d, l, split_value, visualize) + elif split_dim == 1: + mx_left, mi_left = self.build(data_left, label_left, now.left, split_value, d, l, r, visualize) + else: + mx_left, mi_left = self.build(data_left, label_left, now.left, u, d, l, r, visualize) + + for dim in range(self.dim): + now.mx[dim], now.mi[dim] = max(now.mx[dim], mx_left[dim]), min(now.mi[dim], mi_left[dim]) + if data_right.size != 0: + now.right = self.new_node() + now.right.father = now + + if split_dim == 0: + mx_right, mi_right = self.build(data_right, label_right, now.right, u, d, split_value, r, visualize) + elif split_dim == 1: + mx_right, mi_right = self.build(data_right, label_right, now.right, u, split_value, l, r, visualize) + else: + mx_right, mi_right = self.build(data_right, label_right, now.right, u, d, l, r, visualize) + + for dim in range(self.dim): + now.mx[dim], now.mi[dim] = max(now.mx[dim], mx_right[dim]), min(now.mi[dim], mi_right[dim]) + return now.mx, now.mi + + def dist(self, x, y): + return np.sqrt(sum((x - y)**2)) + + def plane_dist(self, x, mi, mx): + """ + 计算询问点和子树内数据的距离下界,可以看作超平面距离 + """ + res = 0. + for dim in range(self.dim): + if mi[dim] <= x[dim] <= mx[dim]: + res += 0. + else: + res += min(abs(x[dim] - mx[dim]), abs(x[dim] - mi[dim]))**2 + return np.sqrt(res) + + def traverse(self, now): + """ + 树上遍历,寻找 K 近邻 + """ + if now.data.size: + mi_idx = np.argmin(np.array([self.dist(data, self.query) for data in now.data])) + result = now.data[mi_idx] + mi_dist = self.dist(self.query, result) + if self.heap.size() < self.query_k: + self.heap.push(mi_dist, now.label[mi_idx]) + elif self.heap.top() > mi_dist: + self.heap.pop() + self.heap.push(mi_dist, now.label[mi_idx]) + + if now.left != None: + if self.heap.size() < self.query_k or self.plane_dist(self.query, now.left.mi, now.left.mx) < self.heap.top(): + self.traverse(now.left) + if now.right != None: + if self.heap.size() < self.query_k or self.plane_dist(self.query, now.right.mi, now.right.mx) < self.heap.top(): + self.traverse(now.right) + + def get(self, data, k): + self.query, self.query_k = data, k + self.heap.clear() + self.traverse(self.root) + + return self.heap.node[1:], self.heap.label[1:] + +class KNN: + + # knn_algo KNN 的底层算法 + # k_best 最优超参 k + # k_lower 扫描超参 k 的下界 + # k_upper 扫描超参 k 的上界(不含) + # k_step 扫描超参 k 的步长 + + def distance(self, x, y, type = "euclidean", lp = -1.): + """ + 支持多样化距离的计算,包括欧几里得距离、曼哈顿距离、切比雪夫距离 + 以及更一般的 Lp 范数(要求 p 为正实数,且不超过 100),操作不合法时 + 返回 0. + """ + if type == "euclidean": + return np.sqrt(sum((x - y)**2)) + elif type == "manhattan": + return sum(abs(x - y)) + elif type == "chebyshev": + return max(abs(x - y)) + elif type == "Lp": + if 0 < lp <= 100.: + return sum((abs(x - y)) ** lp) ** (1. / lp) + else: + print("Error: Lp-norm is illegal.") + return 0. + else: + print("Error: Can't identify the type of distance.") + return 0. + + + + def assess(self, train_label, test_label, type = "accuracy"): + """ + 对标签预测结果进行评估 + """ + if type == "accuracy": + return sum(train_label == test_label) / train_label.shape[0] + else: + print("Error: Can't identify the type of assessment.") + return -1. + + + + def __init__(self, algo = "brute_force", lower = 1, upper = 20, step = 1, dist_type = "euclidean", lp = 2.): + """ + 初始化时选择 KNN 实现的底层算法,可以选择暴力法或KD树 + """ + if algo == "brute_force": + self.knn_algo = "brute_force" + elif algo == "kd_tree": + self.knn_algo = "kd_tree" + else: + print("Error: Can't identify the algorithm of KNN.") + + self.k_best = 1 + self.k_lower = lower + self.k_upper = upper + self.k_step = step + self.dist_type = dist_type + self.lp = lp + + + + def run(self, train_data, train_label, test_data, k): + """ + KNN 算法运行部分 + """ + result = [] + if self.knn_algo == "brute_force": + for datax in test_data: + train_data_dist = np.array([self.distance(datax, datay, self.dist_type, self.lp) for datay in train_data]) + k_nearest_label = train_label[np.argsort(train_data_dist)[:min(k, len(train_data_dist))]] + result.append(np.argmax(np.bincount(k_nearest_label))) + return np.array(result) + elif self.knn_algo == "kd_tree": + self.kdt = KDT(train_data, train_label, train_data.shape[1], 6, -6, -6, 6) + for datax in test_data: + dist_res, label_res = self.kdt.get(datax, k) + result.append(np.argmax(np.bincount(label_res))) + return np.array(result) + else: + print("Error: Can't identify the algorithm of KNN.") + return + + + def fit(self, train_data, train_label, preprocess = "none"): + """ + 将数据转换为 numpy.ndarray 类型,并随机打乱 + """ + train_data, train_label = np.array(train_data), np.array(train_label) + index = [i for i in range(len(train_data))] + np.random.shuffle(index) + train_data = train_data[index] + train_label = train_label[index] + """ + 可以使用 z_score 或 min_max 进行预处理,也可以不对数据进行预处理, + 注意预处理在数据每一维上分别进行 + """ + if preprocess == "z_score": + for dim in range(train_data.shape[1]): + self.mx[dim] = max(train_data[:, dim]) + self.mi[dim] = min(train_data[:, dim]) + train_data[:, dim] = (train_data[:, dim] - self.mi[dim]) / (self.mx[dim] - self.mi[dim]) + elif preprocess == "min_max": + for dim in range(train_data.shape[1]): + self.mu[dim] = np.mean(train_data[:, dim]) + self.std[dim] = np.std(train_data[:, dim]) + train_data[:, dim] = (train_data[:, dim] - self.mu[dim]) / self.std[dim] + elif preprocess == "none": + pass + else: + print("Error: Can't identify the algorithm of preprocessing.") + """ + 将必要数据保存到对象自身的成员变量中 + """ + self.preprocess = preprocess + self.train_data = train_data + self.train_label = train_label + """ + 将数据以 4:1 比例分为训练集和验证集,对给定区间中的超参数 k 进行搜索, + 选择最优的 k 超参数并绘制折线图观察训练情况 + + 当然,KNN 实际并没有训练过程,只是相当于记住了训练集而已。但为了方便, + 借用了这个名称 + + 这里使用类似 K 折交叉验证的方法计算平均准确率 + """ + self.best_acc = 0. + training_epoch = 5 + dim = train_data.shape[1] + for now_k in range(self.k_lower, self.k_upper, self.k_step): + tot_acc = 0 + for epoch in range(training_epoch): + l = (train_data.shape[0] * epoch) // training_epoch + r = (train_data.shape[0] * (epoch + 1)) // training_epoch + acc = self.assess(self.run( + train_data = np.append(train_data[0:l], train_data[r:]).reshape(-1, dim), + train_label = np.append(train_label[0:l], train_label[r:]).reshape(-1), + test_data = train_data[l:r], + k = now_k), + test_label = train_label[l:r]) + tot_acc += acc + print("\taccuracy on epoch %d is: %f" % (epoch, acc)) + tot_acc /= training_epoch + if tot_acc > self.best_acc: + self.best_k, self.best_acc = now_k, tot_acc + print("accuracy on k = %d is: %f\n" % (now_k, tot_acc)) + + + + def predict(self, test_data): + test_data = np.array(test_data) + + if self.preprocess == "z_score": + for dim in range(test_data.shape[1]): + test_data[:, dim] = (test_data[:, dim] - self.mi[dim]) / (self.mx[dim] - self.mi[dim]) + elif self.preprocess == "min_max": + for dim in range(train_data.shape[1]): + test_data[:, dim] = (test_data[:, dim] - self.mu[dim]) / self.std[dim] + elif self.preprocess == "none": + pass + else: + print("Error: Can't identify the algorithm of preprocessing.") + + result = self.run(self.train_data, self.train_label, test_data, self.best_k) + return result + + + +def gen(num, mean, cov): + return np.random.multivariate_normal(mean, cov, num) + +def gen_on_circle(group_num, dist, num, cov): + """ + 在圆上均匀放置中心,生成数据群落 + """ + data = np.zeros((group_num, num, 2)) + label = np.zeros((group_num, num), dtype = 'int64') + for i in range(group_num): + data[i] = gen(num, [dist * np.cos(2 * np.pi * i / group_num), dist * np.sin(2 * np.pi * i / group_num)], cov) + label[i] += i + return data, label + +def gen_on_square(dist, num, cov): + """ + 在矩形四角上放置中心,生成数据群落 + """ + data = np.zeros((2, 2, num, 2)) + label = np.zeros((2, 2, num), dtype = 'int64') + for i in range(2): + for j in range(2): + data[i, j] = gen(num, [dist * i, dist * j], cov) + label[i, j] += (i * 2 + j) + return data, label + +def gen_random_char(): + x = rd.randint(0, 35) + if x < 10: + return chr(ord('0') + x) + else: + return chr(ord('a') + x - 10) + +def gen_random_filename(): + """ + 生成由小写字母和数字组成的随机字符串,用于保存图片的文件名,防止重复 + """ + s = "" + for i in range(20): + s += gen_random_char() + return s + +def test_on_circle(): + data_num, train_num, data_dist, group_lower, group_upper, lc = 300, 240, 3, 3, 7, 0.9 + color = np.array([(0., 0., 0.), (0., 0., 1.), (0., 1., 0.), (1., 0., 0.), (0., 1., 1.), (1., 0., 1.), (1., 1., 0.)]) + light_color = np.array([(0., 0., 0.), (lc, lc, 1.), (lc, 1., lc), (1., lc, lc), (0.8, 1., 1.), (1., lc, 1.), (1., 1., lc)]) + plt.figure() + fig, axe = plt.subplots(2, group_upper - group_lower) + axe = axe.reshape(2, -1) + + font = {'family': 'serif', + 'color': 'black', + 'weight': 'normal', + 'size': 8, + } + + for i in range(group_lower, group_upper): + _train_num = i * train_num + data_groups, label_groups = gen_on_circle(i, data_dist, data_num, [[1, 0], [0, 1]]) + for j in range(i): + axe[0, i - group_lower].scatter(data_groups[j, :, 0], data_groups[j, :, 1], color = color[j + 1], s = 4.) + axe[0, i - group_lower].set_title("n = %d" % i, fontdict = font) + + data, label = [], [] + for j in range(i): + data = np.append(data, data_groups[j]) + label = np.append(label, label_groups[j]) + data = data.reshape(-1, 2) + label = label.reshape(-1).astype(np.int64) + + index = [i for i in range(len(data))] + np.random.shuffle(index) + data = data[index] + label = label[index] + + model = KNN(algo = "kd_tree") + model.fit(data[:_train_num], label[:_train_num]) + res = model.predict(data[_train_num:]) + axe[1, i - group_lower].scatter(data[:_train_num, 0], data[:_train_num, 1], color = light_color[label[:_train_num] + 1], s = 4.) + axe[1, i - group_lower].scatter(data[_train_num:, 0], data[_train_num:, 1], color = color[(res == label[_train_num:]) * (1 + label[_train_num:])], s = 4.) + acc = np.mean(np.equal(res, label[_train_num:])) + axe[1, i - group_lower].set_xlabel("acc = %.3lf, k = %d" % (acc, model.best_k), fontdict = font) + + print("acc =", acc) + + plt.savefig(gen_random_filename(), format = 'png', dpi = 1000) + + +def test_distance_type(): + data_num, train_num, data_dist, group_lower, group_upper, lc = 30, 24, 2, 3, 7, 0.9 + color = np.array([(0., 0., 0.), (0., 0., 1.), (0., 1., 0.), (1., 0., 0.), (0., 1., 1.), (1., 0., 1.), (1., 1., 0.)]) + light_color = np.array([(0., 0., 0.), (lc, lc, 1.), (lc, 1., lc), (1., lc, lc), (0.8, 1., 1.), (1., lc, 1.), (1., 1., lc)]) + + dist_type = ["chebyshev", "manhattan", "euclidean", "Lp", "Lp", "Lp"] + lp = [0, 0, 0, 10, 50, 100] + + font = {'family': 'serif', + 'color': 'black', + 'weight': 'normal', + 'size': 8, + } + + plt.figure() + fig, axe = plt.subplots(7, group_upper - group_lower) + axe = axe.reshape(7, -1) + + + for i in range(group_lower, group_upper): + _train_num = i * train_num + data_groups, label_groups = gen_on_circle(i, data_dist, data_num, [[1, 0], [0, 1]]) + for j in range(i): + axe[0, i - group_lower].scatter(data_groups[j, :, 0], data_groups[j, :, 1], color = color[j + 1], s = 4.) + axe[0, i - group_lower].set_title("n = %d" % i, fontdict = font) + + data, label = [], [] + for j in range(i): + data = np.append(data, data_groups[j]) + label = np.append(label, label_groups[j]) + data = data.reshape(-1, 2) + label = label.reshape(-1).astype(np.int64) + + index = [i for i in range(len(data))] + np.random.shuffle(index) + data = data[index] + label = label[index] + + for T in range(6): + model = KNN(dist_type = dist_type[T], lp = lp[T]) + model.fit(data[:_train_num], label[:_train_num]) + res = model.predict(data[_train_num:]) + axe[T + 1, i - group_lower].scatter(data[:_train_num, 0], data[:_train_num, 1], color = light_color[label[:_train_num] + 1], s = 4.) + axe[T + 1, i - group_lower].scatter(data[_train_num:, 0], data[_train_num:, 1], color = color[(res == label[_train_num:]) * (1 + label[_train_num:])], s = 4.) + acc = np.mean(np.equal(res, label[_train_num:])) + # axe[T + 1, i - group_lower].set_xlabel("acc = %.3lf, k = %d" % (acc, model.best_k), fontdict = font) + + print("acc =", acc) + + plt.savefig(gen_random_filename(), format = 'png', dpi = 1000) + + +def KDT_test(): + data_num = 30 + data = np.random.multivariate_normal((1000., 1000.), [[2000., 0.], [0., 2000.]], data_num) + label = np.zeros(data_num) + color = np.array([(0., 0., 0.), (0., 0., 1.), (0., 1., 0.), (1., 0., 0.), (0., 1., 1.), (1., 0., 1.), (1., 1., 0.)]) + for i in range(data_num): + label[i] = np.random.randint(7) + label = np.array(label).astype(np.int64) + plt.figure() + plt.subplot() + plt.scatter(data[:, 0], data[:, 1], color = color[label], s = 10.) + + + kdt = KDT(data, label, 2, 1150, 850, 850, 1150, True) # 设置为 True 即可绘制 KDT 示意图,但要求数据是二维形式,否则无法在平面上正确表示 + + query = np.random.multivariate_normal((1000., 1000.), [[1000., 0.], [0., 1000.]], 1).reshape(-1) + plt.scatter(query[0], query[1], color = 'g', s = 10.) + plt.show() + #plt.savefig(gen_random_filename(), format = 'png', dpi = 1000) + print(query) + + print(kdt.get(query, 5)) + +# KDT_test() +# test_distance_type() +# test_on_circle() + + diff --git a/assignment-1/submission/19307130062/source.py~ b/assignment-1/submission/19307130062/source.py~ new file mode 100644 index 0000000000000000000000000000000000000000..2e59791d3847fabe7565bd0e12d60336513cc69f --- /dev/null +++ b/assignment-1/submission/19307130062/source.py~ @@ -0,0 +1,527 @@ +import numpy as np +import matplotlib.pyplot as plt +import numpy.random as rd + +class Heap: + def __init__(self): + self.node = [0] + self.label = [0] + self.num = 0 + self.tail = 0 + + def top(self): + return self.node[1] + + def pop(self): + """ + 弹出堆内元素,我们先把最后一个元素放到堆顶,再自顶向下交换元素,使堆保持大根性质 + """ + self.node[1] = self.node[self.num] + self.label[1] = self.label[self.num] + self.num -= 1 + now = 1 + while True: + nex, mx = -1, 0 + if now << 1 <= self.num: + if self.node[now << 1] > mx or nex == -1: + mx = self.node[now << 1] + nex = now << 1 + if now << 1 | 1 <= self.num: + if self.node[now << 1 | 1] > mx or nex == -1: + mx = self.node[now << 1 | 1] + nex = now << 1 | 1 + if nex == -1 or mx < self.node[now]: + return + else: + self.node[now], self.node[nex] = self.node[nex], self.node[now] + self.label[now], self.label[nex] = self.label[nex], self.label[now] + now = nex + + def push(self, x, y): + """ + 向堆中插入元素,自底向上交换元素,使堆保持大根性质 + """ + self.num += 1 + if self.num > self.tail: + self.node.append(x) + self.label.append(y) + self.tail += 1 + else: + self.node[self.num] = x + self.label[self.num] = y + now = self.num + while True: + if now > 1 and self.node[now >> 1] < self.node[now]: + self.node[now >> 1], self.node[now] = self.node[now], self.node[now >> 1] + self.label[now >> 1], self.label[now] = self.label[now], self.label[now >> 1] + now >>= 1 + else: + return + + def size(self): + return self.num + + def clear(self): + self.node = [0] + self.label = [0] + self.num = 0 + self.tail = 0 + +class node: + def __init__(self, dim): + self.father = None + self.left = None + self.right = None + self.label = None + self.mi = np.zeros(dim) + self.mx = np.zeros(dim) + self.data = None + for i in range(dim): + self.mi[i], self.mx[i] = 2e18, -2e18 + +class KDT: + + + def __init__(self, data, label, dim, u, d, l, r, visualize = False): + self.node_count = 0 + self.dim = dim + self.root = self.new_node() + self.build(data, label, self.root, u, d, l, r, visualize) + self.heap = Heap() + + def new_node(self): + self.node_count += 1 + return node(self.dim) + + def build(self, data, label, now, u, d, l, r, visualize): + """ 建树函数 + u, d, l, r, visualize 用于控制绘制 KDT 示意图,不影响算法实际运行 + data, label 分别表示当前子树中包含的数据和标签 + now 表示当前子树对应的 KDT 上结点 + """ + split_dim = np.argmax(np.array([np.var(data[:, dim]) for dim in range(self.dim)])) # 找出方差最大的一维,作为分割维 + split_value = np.median(data[:, split_dim]) # 找出分割维的中位数 + data_left, label_left = data[data[:, split_dim] < split_value], label[data[:, split_dim] < split_value] # 数据的左半部分 + data_right, label_right = data[data[:, split_dim] > split_value], label[data[:, split_dim] > split_value] # 数据的右半部分 + data_mid, label_mid = data[data[:, split_dim] == split_value], label[data[:, split_dim] == split_value] # 当前结点上的数据 + now.data, now.label = data_mid, label_mid + + if visualize: # 可视化部分,可以画出 KDT 分割空间的示意图 + if split_dim == 0: + plt.plot([split_value, split_value], [d, u]) + elif split_dim == 1: + plt.plot([l, r], [split_value, split_value]) + + if data_mid.size: + for dim in range(data_mid.shape[1]): + now.mx[dim] = max(data_mid[:, dim]) + now.mi[dim] = min(data_mid[:, dim]) + if data_left.size != 0: # 递归建树,求出每个子树中每一维的极值,用于之后的距离下界判断和剪枝 + now.left = self.new_node() + now.left.father = now + + if split_dim == 0: + mx_left, mi_left = self.build(data_left, label_left, now.left, u, d, l, split_value, visualize) + elif split_dim == 1: + mx_left, mi_left = self.build(data_left, label_left, now.left, split_value, d, l, r, visualize) + else: + mx_left, mi_left = self.build(data_left, label_left, now.left, u, d, l, r, visualize) + + for dim in range(self.dim): + now.mx[dim], now.mi[dim] = max(now.mx[dim], mx_left[dim]), min(now.mi[dim], mi_left[dim]) + if data_right.size != 0: + now.right = self.new_node() + now.right.father = now + + if split_dim == 0: + mx_right, mi_right = self.build(data_right, label_right, now.right, u, d, split_value, r, visualize) + elif split_dim == 1: + mx_right, mi_right = self.build(data_right, label_right, now.right, u, split_value, l, r, visualize) + else: + mx_right, mi_right = self.build(data_right, label_right, now.right, u, d, l, r, visualize) + + for dim in range(self.dim): + now.mx[dim], now.mi[dim] = max(now.mx[dim], mx_right[dim]), min(now.mi[dim], mi_right[dim]) + return now.mx, now.mi + + def dist(self, x, y): + return np.sqrt(sum((x - y)**2)) + + def plane_dist(self, x, mi, mx): + """ + 计算询问点和子树内数据的距离下界,可以看作超平面距离 + """ + res = 0. + for dim in range(self.dim): + if mi[dim] <= x[dim] <= mx[dim]: + res += 0. + else: + res += min(abs(x[dim] - mx[dim]), abs(x[dim] - mi[dim]))**2 + return np.sqrt(res) + + def traverse(self, now): + """ + 树上遍历,寻找 K 近邻 + """ + if now.data.size: + mi_idx = np.argmin(np.array([self.dist(data, self.query) for data in now.data])) + result = now.data[mi_idx] + mi_dist = self.dist(self.query, result) + if self.heap.size() < self.query_k: + self.heap.push(mi_dist, now.label[mi_idx]) + elif self.heap.top() > mi_dist: + self.heap.pop() + self.heap.push(mi_dist, now.label[mi_idx]) + + if now.left != None: + if self.heap.size() < self.query_k or self.plane_dist(self.query, now.left.mi, now.left.mx) < self.heap.top(): + self.traverse(now.left) + if now.right != None: + if self.heap.size() < self.query_k or self.plane_dist(self.query, now.right.mi, now.right.mx) < self.heap.top(): + self.traverse(now.right) + + def get(self, data, k): + self.query, self.query_k = data, k + self.heap.clear() + self.traverse(self.root) + + return self.heap.node[1:], self.heap.label[1:] + +class KNN: + + # knn_algo KNN 的底层算法 + # k_best 最优超参 k + # k_lower 扫描超参 k 的下界 + # k_upper 扫描超参 k 的上界(不含) + # k_step 扫描超参 k 的步长 + + def distance(self, x, y, type = "euclidean", lp = -1.): + """ + 支持多样化距离的计算,包括欧几里得距离、曼哈顿距离、切比雪夫距离 + 以及更一般的 Lp 范数(要求 p 为正实数,且不超过 100),操作不合法时 + 返回 0. + """ + if type == "euclidean": + return np.sqrt(sum((x - y)**2)) + elif type == "manhattan": + return sum(abs(x - y)) + elif type == "chebyshev": + return max(abs(x - y)) + elif type == "Lp": + if 0 < lp <= 100.: + return sum((abs(x - y)) ** lp) ** (1. / lp) + else: + print("Error: Lp-norm is illegal.") + return 0. + else: + print("Error: Can't identify the type of distance.") + return 0. + + + + def assess(self, train_label, test_label, type = "accuracy"): + """ + 对标签预测结果进行评估 + """ + if type == "accuracy": + return sum(train_label == test_label) / train_label.shape[0] + else: + print("Error: Can't identify the type of assessment.") + return -1. + + + + def __init__(self, algo = "brute_force", lower = 1, upper = 20, step = 1, dist_type = "euclidean", lp = 2.): + """ + 初始化时选择 KNN 实现的底层算法,可以选择暴力法或KD树 + """ + if algo == "brute_force": + self.knn_algo = "brute_force" + elif algo == "kd_tree": + self.knn_algo = "kd_tree" + else: + print("Error: Can't identify the algorithm of KNN.") + + self.k_best = 1 + self.k_lower = lower + self.k_upper = upper + self.k_step = step + self.dist_type = dist_type + self.lp = lp + + + + def run(self, train_data, train_label, test_data, k): + """ + KNN 算法运行部分 + """ + result = [] + if self.knn_algo == "brute_force": + for datax in test_data: + train_data_dist = np.array([self.distance(datax, datay, self.dist_type, self.lp) for datay in train_data]) + k_nearest_label = train_label[np.argsort(train_data_dist)[:min(k, len(train_data_dist))]] + result.append(np.argmax(np.bincount(k_nearest_label))) + return np.array(result) + elif self.knn_algo == "kd_tree": + self.kdt = KDT(train_data, train_label, train_data.shape[1], 6, -6, -6, 6) + for datax in test_data: + dist_res, label_res = self.kdt.get(datax, k) + result.append(np.argmax(np.bincount(label_res))) + return np.array(result) + else: + print("Error: Can't identify the algorithm of KNN.") + return + + + def fit(self, train_data, train_label, preprocess = "none"): + """ + 将数据转换为 numpy.ndarray 类型,并随机打乱 + """ + train_data, train_label = np.array(train_data), np.array(train_label) + index = [i for i in range(len(train_data))] + np.random.shuffle(index) + train_data = train_data[index] + train_label = train_label[index] + """ + 可以使用 z_score 或 min_max 进行预处理,也可以不对数据进行预处理, + 注意预处理在数据每一维上分别进行 + """ + if preprocess == "z_score": + for dim in range(train_data.shape[1]): + self.mx[dim] = max(train_data[:, dim]) + self.mi[dim] = min(train_data[:, dim]) + train_data[:, dim] = (train_data[:, dim] - self.mi[dim]) / (self.mx[dim] - self.mi[dim]) + elif preprocess == "min_max": + for dim in range(train_data.shape[1]): + self.mu[dim] = np.mean(train_data[:, dim]) + self.std[dim] = np.std(train_data[:, dim]) + train_data[:, dim] = (train_data[:, dim] - self.mu[dim]) / self.std[dim] + elif preprocess == "none": + pass + else: + print("Error: Can't identify the algorithm of preprocessing.") + """ + 将必要数据保存到对象自身的成员变量中 + """ + self.preprocess = preprocess + self.train_data = train_data + self.train_label = train_label + """ + 将数据以 4:1 比例分为训练集和验证集,对给定区间中的超参数 k 进行搜索, + 选择最优的 k 超参数并绘制折线图观察训练情况 + + 当然,KNN 实际并没有训练过程,只是相当于记住了训练集而已。但为了方便, + 借用了这个名称 + + 这里使用类似 K 折交叉验证的方法计算平均准确率 + """ + self.best_acc = 0. + training_epoch = 5 + dim = train_data.shape[1] + for now_k in range(self.k_lower, self.k_upper, self.k_step): + tot_acc = 0 + for epoch in range(training_epoch): + l = (train_data.shape[0] * epoch) // training_epoch + r = (train_data.shape[0] * (epoch + 1)) // training_epoch + acc = self.assess(self.run( + train_data = np.append(train_data[0:l], train_data[r:]).reshape(-1, dim), + train_label = np.append(train_label[0:l], train_label[r:]).reshape(-1), + test_data = train_data[l:r], + k = now_k), + test_label = train_label[l:r]) + tot_acc += acc + print("\taccuracy on epoch %d is: %f" % (epoch, acc)) + tot_acc /= training_epoch + if tot_acc > self.best_acc: + self.best_k, self.best_acc = now_k, tot_acc + print("accuracy on k = %d is: %f\n" % (now_k, tot_acc)) + + + + def predict(self, test_data): + test_data = np.array(test_data) + + if self.preprocess == "z_score": + for dim in range(test_data.shape[1]): + test_data[:, dim] = (test_data[:, dim] - self.mi[dim]) / (self.mx[dim] - self.mi[dim]) + elif self.preprocess == "min_max": + for dim in range(train_data.shape[1]): + test_data[:, dim] = (test_data[:, dim] - self.mu[dim]) / self.std[dim] + elif self.preprocess == "none": + pass + else: + print("Error: Can't identify the algorithm of preprocessing.") + + result = self.run(self.train_data, self.train_label, test_data, self.best_k) + return result + + + +def gen(num, mean, cov): + return np.random.multivariate_normal(mean, cov, num) + +def gen_on_circle(group_num, dist, num, cov): + """ + 在圆上均匀放置中心,生成数据群落 + """ + data = np.zeros((group_num, num, 2)) + label = np.zeros((group_num, num), dtype = 'int64') + for i in range(group_num): + data[i] = gen(num, [dist * np.cos(2 * np.pi * i / group_num), dist * np.sin(2 * np.pi * i / group_num)], cov) + label[i] += i + return data, label + +def gen_on_square(dist, num, cov): + """ + 在矩形四角上放置中心,生成数据群落 + """ + data = np.zeros((2, 2, num, 2)) + label = np.zeros((2, 2, num), dtype = 'int64') + for i in range(2): + for j in range(2): + data[i, j] = gen(num, [dist * i, dist * j], cov) + label[i, j] += (i * 2 + j) + return data, label + +def gen_random_char(): + x = rd.randint(0, 35) + if x < 10: + return chr(ord('0') + x) + else: + return chr(ord('a') + x - 10) + +def gen_random_filename(): + """ + 生成由小写字母和数字组成的随机字符串,用于保存图片的文件名,防止重复 + """ + s = "" + for i in range(20): + s += gen_random_char() + return s + +def test_on_circle(): + data_num, train_num, data_dist, group_lower, group_upper, lc = 300, 240, 3, 3, 7, 0.9 + color = np.array([(0., 0., 0.), (0., 0., 1.), (0., 1., 0.), (1., 0., 0.), (0., 1., 1.), (1., 0., 1.), (1., 1., 0.)]) + light_color = np.array([(0., 0., 0.), (lc, lc, 1.), (lc, 1., lc), (1., lc, lc), (0.8, 1., 1.), (1., lc, 1.), (1., 1., lc)]) + plt.figure() + fig, axe = plt.subplots(2, group_upper - group_lower) + axe = axe.reshape(2, -1) + + font = {'family': 'serif', + 'color': 'black', + 'weight': 'normal', + 'size': 8, + } + + for i in range(group_lower, group_upper): + _train_num = i * train_num + data_groups, label_groups = gen_on_circle(i, data_dist, data_num, [[1, 0], [0, 1]]) + for j in range(i): + axe[0, i - group_lower].scatter(data_groups[j, :, 0], data_groups[j, :, 1], color = color[j + 1], s = 4.) + axe[0, i - group_lower].set_title("n = %d" % i, fontdict = font) + + data, label = [], [] + for j in range(i): + data = np.append(data, data_groups[j]) + label = np.append(label, label_groups[j]) + data = data.reshape(-1, 2) + label = label.reshape(-1).astype(np.int64) + + index = [i for i in range(len(data))] + np.random.shuffle(index) + data = data[index] + label = label[index] + + model = KNN(algo = "kd_tree") + model.fit(data[:_train_num], label[:_train_num]) + res = model.predict(data[_train_num:]) + axe[1, i - group_lower].scatter(data[:_train_num, 0], data[:_train_num, 1], color = light_color[label[:_train_num] + 1], s = 4.) + axe[1, i - group_lower].scatter(data[_train_num:, 0], data[_train_num:, 1], color = color[(res == label[_train_num:]) * (1 + label[_train_num:])], s = 4.) + acc = np.mean(np.equal(res, label[_train_num:])) + axe[1, i - group_lower].set_xlabel("acc = %.3lf, k = %d" % (acc, model.best_k), fontdict = font) + + print("acc =", acc) + + plt.savefig(gen_random_filename(), format = 'png', dpi = 1000) + + +def test_distance_type(): + data_num, train_num, data_dist, group_lower, group_upper, lc = 30, 24, 2, 3, 7, 0.9 + color = np.array([(0., 0., 0.), (0., 0., 1.), (0., 1., 0.), (1., 0., 0.), (0., 1., 1.), (1., 0., 1.), (1., 1., 0.)]) + light_color = np.array([(0., 0., 0.), (lc, lc, 1.), (lc, 1., lc), (1., lc, lc), (0.8, 1., 1.), (1., lc, 1.), (1., 1., lc)]) + + dist_type = ["chebyshev", "manhattan", "euclidean", "Lp", "Lp", "Lp"] + lp = [0, 0, 0, 10, 50, 100] + + font = {'family': 'serif', + 'color': 'black', + 'weight': 'normal', + 'size': 8, + } + + plt.figure() + fig, axe = plt.subplots(7, group_upper - group_lower) + axe = axe.reshape(7, -1) + + + for i in range(group_lower, group_upper): + _train_num = i * train_num + data_groups, label_groups = gen_on_circle(i, data_dist, data_num, [[1, 0], [0, 1]]) + for j in range(i): + axe[0, i - group_lower].scatter(data_groups[j, :, 0], data_groups[j, :, 1], color = color[j + 1], s = 4.) + axe[0, i - group_lower].set_title("n = %d" % i, fontdict = font) + + data, label = [], [] + for j in range(i): + data = np.append(data, data_groups[j]) + label = np.append(label, label_groups[j]) + data = data.reshape(-1, 2) + label = label.reshape(-1).astype(np.int64) + + index = [i for i in range(len(data))] + np.random.shuffle(index) + data = data[index] + label = label[index] + + for T in range(6): + model = KNN(dist_type = dist_type[T], lp = lp[T]) + model.fit(data[:_train_num], label[:_train_num]) + res = model.predict(data[_train_num:]) + axe[T + 1, i - group_lower].scatter(data[:_train_num, 0], data[:_train_num, 1], color = light_color[label[:_train_num] + 1], s = 4.) + axe[T + 1, i - group_lower].scatter(data[_train_num:, 0], data[_train_num:, 1], color = color[(res == label[_train_num:]) * (1 + label[_train_num:])], s = 4.) + acc = np.mean(np.equal(res, label[_train_num:])) + # axe[T + 1, i - group_lower].set_xlabel("acc = %.3lf, k = %d" % (acc, model.best_k), fontdict = font) + + print("acc =", acc) + + plt.savefig(gen_random_filename(), format = 'png', dpi = 1000) + + +def KDT_test(): + data_num = 30 + data = np.random.multivariate_normal((1000., 1000.), [[2000., 0.], [0., 2000.]], data_num) + label = np.zeros(data_num) + color = np.array([(0., 0., 0.), (0., 0., 1.), (0., 1., 0.), (1., 0., 0.), (0., 1., 1.), (1., 0., 1.), (1., 1., 0.)]) + for i in range(data_num): + label[i] = np.random.randint(7) + label = np.array(label).astype(np.int64) + plt.figure() + plt.subplot() + plt.scatter(data[:, 0], data[:, 1], color = color[label], s = 10.) + + + kdt = KDT(data, label, 2, 1150, 850, 850, 1150, True) # 设置为 True 即可绘制 KDT 示意图,但要求数据是二维形式,否则无法在平面上正确表示 + + query = np.random.multivariate_normal((1000., 1000.), [[1000., 0.], [0., 1000.]], 1).reshape(-1) + plt.scatter(query[0], query[1], color = 'g', s = 10.) + plt.show() + #plt.savefig("1", format = 'png', dpi = 1000) + print(query) + + print(kdt.get(query, 5)) + +# KDT_test() +# test_distance_type() +# test_on_circle() + +