diff --git a/assignment-1/submission/18307130116/README.md b/assignment-1/submission/18307130116/README.md new file mode 100644 index 0000000000000000000000000000000000000000..142f441ff2c2994c3d62c3cbb861fd7c637837f8 --- /dev/null +++ b/assignment-1/submission/18307130116/README.md @@ -0,0 +1,235 @@ +# KNN分类器 + +[toc] + +## 依赖包 + +`numpy` + +`matplotlib` + +## 函数功能介绍 + +### KNN + +**`fit(self, train_data, train_label)`** + +`train_data`训练点集 + +`train_label`训练标签 + +**功能简介:**`fit`函数将会取出训练集中的10%用于寻找让准确率最大的K,如果训练集少于10个点,则会默认`K = 1`,否则将会选择1到10中使得准确率最大的K,作为预测时使用的K + +--- + +`predict(self, test_data)` + +**功能简介:**根据前一步学习到的K预测对应的类别 + +### 实验函数与辅助函数 + +**`distance(point1, point2, method="Euclid")`** + +`point1`和`point2`为需要计算距离的两个点 + +`method`给出了计算距离的指标,默认为欧氏距离,`Manhattan`可按照曼哈顿距离计算 + +**功能简介:**函数开始会将输入标准化为[m, 1]的向量,并按照相应的方式计算两个点之间的距离 + +------- + +**`dis(dis_label)`** + +**功能简介:**`sort`的`key`函数,取出二元组(distance, label)中的distance + +--- + +**`nearest_k_label_max(point, point_arr, label_arr, k)`** + +`point`需寻找k个临近点的目标点 + +`point_arr`已有的点集 + +`label_arr`已有点集对应的标签集合 + +`k`考虑的最近的点的数量 + +**功能简介:**函数将计算目标点和点集中所有点的距离,找到K个距离最近点,并返回出现最多次数的`label` + +--- + +**`data_generate_and_save(class_num, mean_list, cov_list, num_list, save_path = "")`** + +`class_num` 共包含的类的数量 + +`mean_list` 各个类的高斯分布对应的均值矩阵 + +`cov_list` 各个类的协方差矩阵 + +`num_list` num_list[i]对应于第i个类的点数 + +`save_path` 生成的点集的存储路径,默认为当前目录下的`data.npy`,路径需以下划线结尾 + +**功能简介:**该函数通过调用`numpy.random.multivariate_normal`,生成指定数目的点,随机打乱后,划分其中的80%为训练数据,20%为测试数据,以元组`((train_data, train_label), (test_data, test_label))`的形式保存 + +--- + +**`data_load(path = "")`** + +`path` 加载点集的存储路径,默认为当前目录下的`data.npy`,路径需以下划线结尾 + +**功能简介:**点集需以元组`((train_data, train_label), (test_data, test_label))`的形式保存 + +--- + +**`visualize(data, label, class_num = 1, test_data=[])`** + +*可视化目前只支持二维,如果是高维点集,将只可视化前两维* + +`data` 训练点集坐标 + +`label`训练点集对应的标签 + +`class_num`类别总数,默认值为1 + +`test_data`测试点集坐标 + +**功能简介:**绘制点集散点图,不同类别自适应的用不同颜色表征,测试点集将通过"+"表征 + +## 实验 + +首先,我们生成了三类坐标点,每类数量100 + +其对应的数量和协方差矩阵如下表所示 + +| | 均值 | 协方差矩阵 | +| ------- | ------- | ----------------- | +| class 1 | (1, 2) | [[10, 0], [0, 2]] | +| class 2 | (4, 5) | [[7, 3], [15, 1]] | +| class 3 | (-2, 6) | [[0, 1], [1, 2]] | + +测试了1-10对应的准确率,如下图所示 + +k1 + +在保证准确率不变的条件下,选择较小的数值k=5,预测的准确率达83.3%,对应数据可视化如下图 + +Figure_1 + +### 对比实验1:减少点集重叠 + +上图能较为清晰的看到,三种颜色的点集分布基本分离开,但是仍存在一部分重叠,推测重叠部分会使得KNN效果变差,下面通过改变均值和协方差验证这一结论 + +首先将协方差对应更改成为 + +| | 均值 | 协方差矩阵 | +| ------- | ------- | --------------- | +| class 1 | (1, 2) | [[1,0], [0, 1]] | +| class 2 | (4, 5) | [[1,0], [0, 1]] | +| class 3 | (-2, 6) | [[1,0], [0, 1]] | + +对应K的曲线和点集分布图如下 + +Figure_2_1Figure_2_2 + +此时选择K = 3,对应的KNN准确率已经提高到了96.7%符合预期 + +同样的,我们更改对应的均值大小,使得高斯分布尽可能分开 + +| | 均值 | 协方差矩阵 | +| ------- | --------- | ----------------- | +| class 1 | (-10, 2) | [[10, 0], [0, 2]] | +| class 2 | (4, 5) | [[7, 3], [15, 1]] | +| class 3 | (-2, -16) | [[0, 1], [1, 2]] | + +对应曲线如下,准确率达到1.0,此时K=1已经达到了最大值 + +Figure_2_3Figure_2_4 + +#### 结论 + +从该对比实验中,我们能够较为清晰的看到点集分布对于KNN准确率的影响,当类之间重合度较低时,KNN的准确率显著提升 + +### 对比实验2:距离选择 + +在上述实验中,我们采用的距离为欧式距离,下面将更改距离计算方式为曼哈顿距离,考察对应的影响 + +当点集区分较开时,曼哈顿距离与欧式距离在准确率上差别不大,这里不做展示,当点集重叠程度较高时,对以下分布生成了多组数据 + +| | 均值 | 协方差矩阵 | +| ------- | ------ | ----------------- | +| class 1 | (1, 4) | [[10, 0], [0, 2]] | +| class 2 | (2, 5) | [[7, 3], [15, 1]] | +| class 3 | (2, 6) | [[0, 1], [1, 2]] | + +对应的k值选取和准确率(acc)如下表所示 + +| 欧氏距离 | 曼哈顿距离 | +| ----------------- | ------------------ | +| k = 3, acc = 0.7 | k=3, acc= 0.683 | +| k = 1, acc = 0.53 | k = 1, acc = 0.483 | +| k = 7, acc = 0.63 | k = 8, acc = 0.567 | + +综合来看点集分布重叠程度较高时,欧氏距离优于曼哈顿距离,推测以高斯分布生成的点,欧式距离对某一维度上较大差距的惩罚大于曼哈顿距离,较符合高斯分布点生成方式,较好拟合当前位置的概率密度,从而准确率更高 + +#### 结论 + +当点集区分较开时,曼哈顿距离和欧式距离差别不大,点集重合较大时,欧式距离由于曼哈顿距离 + +### 对比实验3:点集数量 + +对于如下分布 + +| | 均值 | 协方差矩阵 | +| ------- | ------- | ----------------- | +| class 1 | (1, 4) | [[10, 0], [0, 2]] | +| class 2 | (2, -3) | [[7, 3], [15, 1]] | +| class 3 | (2, 5) | [[0, 1], [1, 2]] | + +分别生成了[100, 100, 100], [100, 10, 100], [100, 50, 200],[200, 200, 200]四组,每组多次避免偶然误差 + +结果如下表格所示 + +| | [100, 100, 100] | [100, 10, 100] | [100, 50, 200] | [200, 200, 200] | +| ---- | --------------- | -------------- | -------------- | --------------- | +| 1 | 0.867 | 0.809 | 0.886 | 0.875 | +| 2 | 0.800 | 0.809 | 0.843 | 0.825 | +| 3 | 0.867 | 0.809 | 0.857 | 0.9 | +| 4 | 0.917 | 0.761 | 0.886 | 0.792 | +| 平均 | 0.862 | 0.797 | 0.868 | 0.848 | + +#### 结论 + +当点集数量上升时,增大重叠面积,准确率相应下降,当某组点数量显著小于其他点集时,将会较大影响到准确率,当差距过大时,将会一定程度上退化成N-1分类问题,反而导致准确率提升 + +### 对比实验4:各维度尺度 + +当各个维度的尺度并不匹配时,例如(年龄,财产)二元组,基于空间上欧式距离相当于退化成为闵式距离,为进一步对比其影响,生成了如下数据 + +| | 均值 | 协方差矩阵 | +| ------- | -------- | --------------------- | +| class 1 | (1, 400) | [[10, 0], [0, 20000]] | +| class 2 | (2, 300) | [[7, 0], [0, 10000]] | +| class 3 | (2, 300) | [[1, 0], [0, 10000]] | + +其中一组对应k和点集分布如下图所示,多次测量的平均准确率为0.399 + +Figure_5_1Figure_5_2 + +为了对比其影响,我们等比例放缩对应的维度100倍, + +| | 均值 | 协方差矩阵 | +| ------- | ------ | ----------------- | +| class 1 | (1, 4) | [[10, 0], [0, 2]] | +| class 2 | (2, 3) | [[7, 0], [15, 1]] | +| class 3 | (2, 3) | [[1, 0], [0, 1]] | + +对应的k和点集可视化如下图 + +Figure_6_1Figure_6_2 + +多次测量的平均准确率为0.539 + +#### 结论 + +尺度归一化较大程度的影响了准确率的大小,通过等比例尺度放缩,准确率有了较大提升,但是,结合前面点集分布的表现,推测当点集自身区分较开时,归一化的影响不大 \ No newline at end of file diff --git a/assignment-1/submission/18307130116/img/Figure_1.png b/assignment-1/submission/18307130116/img/Figure_1.png new file mode 100644 index 0000000000000000000000000000000000000000..b840aa5b2862be15a71968435433efc147086318 Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_1.png differ diff --git a/assignment-1/submission/18307130116/img/Figure_2_1.png b/assignment-1/submission/18307130116/img/Figure_2_1.png new file mode 100644 index 0000000000000000000000000000000000000000..5e2b73e556a36aa5db294e9c2c42fc039728279d Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_2_1.png differ diff --git a/assignment-1/submission/18307130116/img/Figure_2_2.png b/assignment-1/submission/18307130116/img/Figure_2_2.png new file mode 100644 index 0000000000000000000000000000000000000000..3c6ec2fa9693474116ae15a76359f69b442d99b1 Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_2_2.png differ diff --git a/assignment-1/submission/18307130116/img/Figure_2_3.png b/assignment-1/submission/18307130116/img/Figure_2_3.png new file mode 100644 index 0000000000000000000000000000000000000000..a893f35d277af8c818a69f49cee5e2bbe06c2367 Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_2_3.png differ diff --git a/assignment-1/submission/18307130116/img/Figure_2_4.png b/assignment-1/submission/18307130116/img/Figure_2_4.png new file mode 100644 index 0000000000000000000000000000000000000000..34e3cb5e2c15ae4104a1f12fbd9ef62af24cb03e Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_2_4.png differ diff --git a/assignment-1/submission/18307130116/img/Figure_5_1.png b/assignment-1/submission/18307130116/img/Figure_5_1.png new file mode 100644 index 0000000000000000000000000000000000000000..09921dca1bbeebae81d5b0f71eafe9ab0f0ce75a Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_5_1.png differ diff --git a/assignment-1/submission/18307130116/img/Figure_5_2.png b/assignment-1/submission/18307130116/img/Figure_5_2.png new file mode 100644 index 0000000000000000000000000000000000000000..18ed90b7cd1ec5f2c91a863393b21b655b040eb6 Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_5_2.png differ diff --git a/assignment-1/submission/18307130116/img/Figure_6_1.png b/assignment-1/submission/18307130116/img/Figure_6_1.png new file mode 100644 index 0000000000000000000000000000000000000000..6fdc07c00f7cfdcead4f8cf98880ce1cd76f9526 Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_6_1.png differ diff --git a/assignment-1/submission/18307130116/img/Figure_6_2.png b/assignment-1/submission/18307130116/img/Figure_6_2.png new file mode 100644 index 0000000000000000000000000000000000000000..72685efbfd9bc42f675811e5f92bf88c6bbc3851 Binary files /dev/null and b/assignment-1/submission/18307130116/img/Figure_6_2.png differ diff --git a/assignment-1/submission/18307130116/img/k1.png b/assignment-1/submission/18307130116/img/k1.png new file mode 100644 index 0000000000000000000000000000000000000000..8a81a8e624428a86d14851ca1a9848cf11c61be0 Binary files /dev/null and b/assignment-1/submission/18307130116/img/k1.png differ diff --git a/assignment-1/submission/18307130116/source.py b/assignment-1/submission/18307130116/source.py new file mode 100644 index 0000000000000000000000000000000000000000..4daa13c95a45ed7371bb33f20bdd2f4d821894ae --- /dev/null +++ b/assignment-1/submission/18307130116/source.py @@ -0,0 +1,154 @@ +import numpy as np +import matplotlib.pyplot as plt +import matplotlib.cm as cm + +def distance(point1, point2, method="Euclid"): + """ + suppose dimention of points is m * 1 + """ + if point1.ndim == 1: + point1 = np.expand_dims(point1, axis=1) + if point2.ndim == 1: + point2 = np.expand_dims(point2, axis=1) + if point1.shape[0] == 1: + point1 = point1.reshape(-1, 1) + if point2.shape[0] == 1: + point2 = point2.reshape(-1, 1) + dimention_num = point1.shape[0] + result = 0 + if(method == "Euclid"): + if dimention_num != point1.size: + print("error") + return -1 + for iter in range(dimention_num): + result += (point1[iter, 0]-point2[iter, 0])**2 + return pow(result, 0.5) + if(method == "Manhattan"): + if dimention_num != point1.size: + print("error") + return -1 + for iter in range(dimention_num): + result += abs(point1[iter, 0]-point2[iter, 0]) + return result + +def dis(dis_label): + return dis_label[0] + +def nearest_k_label_max(point, point_arr, label_arr, k): + distance_arr = [] + for iter in range(len(point_arr)): + distance_arr.append((distance(point, point_arr[iter]), label_arr[iter])) + distance_arr.sort(key=dis) + result = [] + for iter in range(k): + result.append(distance_arr[iter][1]) + return max(result, key=result.count) + +class KNN: + + def __init__(self): + pass + + def fit(self, train_data, train_label): + num = train_data.shape[0] + dimention_num = train_data.shape[1] + self.train_data = train_data + self.train_label = train_label + dev_num = int(num * 0.1) + dev_data = train_data[:dev_num] + dev_label = train_label[:dev_num] + train_data = train_data[dev_num:] + train_label = train_label[dev_num:] + correct_cout_max = 0 + k_max = 0 + accu = [] + if dev_num == 0: + print("points number too few, so we choose k = 1") + self.k = 1 + return + + for iter in range(1, min(num-dev_num, 10)):#find the best k + correct_count = 0 + for j in range(len(dev_data)): + predict_label = nearest_k_label_max(dev_data[j], train_data, train_label, iter) + if(predict_label == dev_label[j]): + correct_count += 1 + if correct_count > correct_cout_max: + correct_cout_max = correct_count + k_max = iter + accu.append(correct_count/dev_num) + x = range(1, min(num-dev_num, 10)) + #this part is only for experiment, so I commented it for auto test + # plt.plot(x,accu) + # plt.show() + self.k = k_max + print("choose k=", k_max) + + def predict(self, test_data): + result = [] + for iter in range(len(test_data)): + result.append(nearest_k_label_max(test_data[iter,:], self.train_data, self.train_label, self.k)) + return np.array(result) + +#here we need some utils +def data_generate_and_save(class_num, mean_list, cov_list, num_list, save_path = ""): + """ + class_num: the number of class + mean_list: mean_list[i] stand for the mean of class[i] + cov_list: similar to mean_list, stand for the covariance + num_list: similar to mean_list, stand for the number of points in class[i] + save_path: the data storage path, end with slash. + """ + data = np.random.multivariate_normal(mean_list[0], cov_list[0], (num_list[0],)) + label = np.zeros((num_list[0],),dtype=int) + total = num_list[0] + + for iter in range(1, class_num): + temp = np.random.multivariate_normal(mean_list[iter], cov_list[iter], (num_list[iter],)) + label_temp = np.ones((num_list[iter],),dtype=int)*iter + data = np.concatenate([data, temp]) + label = np.concatenate([label, label_temp]) + total += num_list[iter] + + idx = np.arange(total) + np.random.shuffle(idx) + data = data[idx] + label = label[idx] + train_num = int(total * 0.8) + train_data = data[:train_num, ] + test_data = data[train_num:, ] + train_label = label[:train_num, ] + test_label = label[train_num:, ] + # print(test_label.size) + np.save(save_path+"data.npy", ((train_data, train_label), (test_data, test_label))) + +def data_load(path = ""): + (train_data, train_label), (test_data, test_label) = np.load(path+"data.npy",allow_pickle=True) + return (train_data, train_label), (test_data, test_label) + +def visualize(data, label, class_num = 1, test_data=[]): + data_x = {} + data_y = {} + for iter in range(class_num): + data_x[iter] = [] + data_y[iter] = [] + for iter in range(len(label)): + data_x[label[iter]].append(data[iter, 0]) + data_y[label[iter]].append(data[iter, 1]) + colors = cm.rainbow(np.linspace(0, 1, class_num)) + + for class_idx, c in zip(range(class_num), colors): + plt.scatter(data_x[class_idx], data_y[class_idx], color=c) + if(len(test_data) != 0): + plt.scatter(test_data[:, 0], test_data[:, 1], marker='+') + plt.show() + +#experiment begin +if __name__ == "__main__": + mean_list = [(1, 4), (2, 3), (2, 3)] + cov_list = [np.array([[10, 0], [0, 2]]), np.array([[7, 0], [0, 1]]), np.array([[1, 0], [0, 1]])] + num_list = [200, 200, 200] + save_path = "" + data_generate_and_save(3, mean_list, cov_list, num_list, save_path) + # (train_data, train_label), (test_data, test_label) = data_load() + # visualize(train_data, train_label, 3) \ No newline at end of file