diff --git a/assignment-1/submission/17307130331/README.md b/assignment-1/submission/17307130331/README.md new file mode 100644 index 0000000000000000000000000000000000000000..12fe357e4a1270b23221cab14d6967790723ce97 --- /dev/null +++ b/assignment-1/submission/17307130331/README.md @@ -0,0 +1,173 @@ +# 课程报告 +## 1 KNN类实现 + +KNN类参考`sklearn`的接口,实现了`fit`,`predict`,`score`等方法,以及一些计算两向量之间距离、进行数据预处理的辅助函数。 +KNN类的参数包括: +- `k`,超参数k. +- `metric`, 指采用何种距离函数,有效值为`'euclidean'`,`'manhattan'`和`'cosine'`. +- `norm`, 指采用何种数据归一化方式,有效值为`'None'`,`'zscore'`,`'minmax'`. + +在类初始化时,需要提供这些参数(或保留默认值)。在`__init__`方法中会将参数保存。 + +### 1.1 fit() 函数实现 + +接口:`fit(self, train_data, train_label)`. +fit函数的输入是两个`numpy`数组:训练样本和训练标签。 + +fit函数的功能包括: +- 若`norm`不为`'None'`,对输入数据进行相应预处理。 +- 保存训练标签。 +- 保存训练集中类的个数。 + +### 1.2 predict() 函数实现 + +接口: `predict(self, test_data)`. + +predict函数会首先将输入的测试数据与训练样本做相同的预处理(或者都不做预处理),其中,进行z-score预处理时使用的均值和标准差都是先前保存的训练集的均值和标准差;进行min-max预处理时使用的最大最小值是训练集的最大最小值。 + +然后计算测试集样本与训练集样本两两之间的距离,作为一个`numpy`的(test_shape[0], train_shape[0])的矩阵。采用`np.argpartition`方法对距离矩阵每一行进行部分排序,使得前k个元素为行向量中的最小k个元素的下标(不一定按序)。如此保存矩阵的前k列,便保存了每个测试集样本的最近k邻居。 + +然后根据`train_label`,取得k个邻居的标签并进行多数投票,这个操作可以用`np.bincount`配合`np.argmax`完成,得到`test_label`. + +### 1.3 其他函数 + +距离计算: +- euclidean_(self, x, y) 计算向量x和y的欧式距离 +- manhattan_(self, x, y) 计算向量x和y的几何距离 +- cosine(self, x, y) 计算向量x和y的余弦距离 + +归一化: +- zscore_norm_(self, data) 对输入数据做z-score归一,如果已经保存了均值和标准差则直接使用,不再计算和更新。 +- minmax_norm_(self, data) 对输入数据做最大最小值归一化,同上。 + +`score(self, test_data, test_label)`: 参考`sklearn`的接口,直接计算测试的准确率。 + + +## 2 实验部分 + +## 2.1 实验一:k值的选择 + +### 2.1.1 类中心相距较远,边界较清晰的数据集1 + +该数据集包含三个类,分别从二维高斯分布中采样500个数据点。三个类的均值和方差分别为: + +$$ \mu_{1} = [0,0] \text{ } \mu_{2} = [20,20] \text{ } \mu_{3} = [-20,20] $$ +$$ \Sigma_{1} = [85, 77] \ \Sigma_{2} = [63, 78] \ \Sigma_{3} = [68,19] $$ + +数据集分布: + +![img](img/datadistribution1.png) + +将数据集分为4:1的训练集与测试集后,在**训练集**上进行5折交叉检验。固定距离计算方式为‘euclidean’,正则化方式为zscore,计算k取[1,20]每一个值时交叉检验的平均准确率,结果如下: + +![img2](img/crossvalidationofk1.png) + +可知在这个数据集上,KNN准确率总体较高(都在92%以上)。当k较小时,随着k增大准确率会快速提高,而当k较大时,准确率将趋于平稳并在某个值附近波动。当k取16时,准确率最高为95.17%。 + +取k=16,测试的准确率为94.33%. 测试数据集的分布和模型预测结果如下: + +![img3](img/testdistribution1.png) + +可发现预测错误的点集中在三个类别边界处。 + +### 2.1.2 类中心距离较近,类分布重合度较高的数据集2 + +数据集2同样包括三个类,各包含500个样本。均值与方差分别为: + +$$ \mu_{1} = [12,13] \text{ } \mu_{2} = [-1,14] \text{ } \mu_{3} = [6,15] $$ +$$ \Sigma_{1} = [71,53] \ \Sigma_{2} = [8,47] \ \Sigma_{3} = [50,35] $$ + +![img4](img/datadistribution2.png) + +同样分为4:1的训练与测试集,对k进行5折交叉检验,结果为: + +![img5](img/crossvalidationofk2.png) + +在这个数据集上,KNN准确率显著低于数据集1,为52%-60%之间。准确率仍在k值较小时快速增加,在k较大时趋于平稳。 当k取14时,准确率最高为60%。 + +取k=14,在测试数据集上准确率为65%。 + +![img6](img/testdistribution2.png) + +## 2.2 实验二:距离函数的选择 + +在本次实验中,实现了三种距离函数:曼哈顿距离、欧式距离与余弦距离。其中余弦距离是1减去两个向量的夹角余弦。这里分别在数据集1和数据集2上进行了5折交叉检验,结果如表格所示: + +| | dataset1 | dataset 2| +|:-|:--|:--| +|manhattan, k=16| 0.948333| **0.599167** | +|cosine,k=16| 0.935833| 0.565833 | +|euclidean, k=16| **0.951667** | 0.597500| +|______________________________________| +|manhattan, k=14| 0.949167| 0.592500| +|cosine, k=14| 0.934167|0.556667 | +|euclidean, k=14| **0.950000**| **0.600833**| +|______________________________________| +|manhattan, k=12| 0.945833| 0.594167 | +|cosine, k=12| 0.934167| 0.552500 | +|euclidean, k=12| **0.946667**| **0.596667** | + +在两个数据集上,使用euclidean或者manhattan距离的准确率差别不大,但总体来说使用euclidean距离稍微更好。 余弦距离在两个数据集上的准确率都显著较低,说明这种距离计算方式不适合这类数据集。因此,在这两个数据集上选择euclidean距离效果更好。 + +## 2.3 实验三: 数据归一化方法 + +首先以数据集1为例,取k为不同的值,取距离函数为euclidean,测试不同的数据归一化方法。数据集1的两个维度有相似的分布,数据范围分别为-45~43和-26~39. + +|归一化方法| 准确率| +|:-|:-| +|k=16,normalization=zscore| **score=0.951667**| +|k=16,normalization=minmax | score=0.948333| +|k=16,normalization=None | score=0.948333| +|____________________________________________| +|k=10,normalization=zscore| **score=0.945833** | +|k=10,normalization=minmax| **score=0.945833** | +|k=10,normalization=None | score=0.945000 | + + +将数据集两个维度的分布差别降低。产生数据集4,中心和方差为: + +$$ \mu_{1} = [0,0] \text{ } \mu_{2} = [6,12] \text{ } \mu_{3} = [5,7] $$ +$$ \Sigma_{1} = [7,5] \ \Sigma_{2} = [1,7] \ \Sigma_{3} = [5,7] $$ + +![imgsss](img/datadistribution4.png) + +|归一化方法| 准确率| +|:-|:-| +|k=16,normalization=zscore| **score=0.890000**| +|k=16,normalization=minmax | score=00.887500| +|k=16,normalization=None | score=0.886667| +|____________________________________________| +|k=10,normalization=zscore| **score=0.881667** | +|k=10,normalization=minmax| score=0.879167 | +|k=10,normalization=None | score=0.880000 | + +然后,将数据集两个维度的分布差别调大。产生数据集3,中心和方差为: + +$$ \mu_{1} = [0,1500] \text{ } \mu_{2} = [-20,1240] \text{ } \mu_{3} = [55, 788] $$ +$$ \Sigma_{1} = [77, 10105] \ \Sigma_{2} = [220, 9567] \ \Sigma_{3} = [330,77500] $$ + +![imgsds](img/datadistribution3.png) + +测试结果: + +|归一化方法| 准确率| +|:-|:-| +|k=16,normalization=zscore| **score=0.960833**| +|k=16,normalization=minmax | **score=0.960833**| +|k=16,normalization=None | score=0.954167| +|____________________________________________| +|k=10,normalization=zscore| **score=0.961667** | +|k=10,normalization=minmax| **score=0.961667** | +|k=10,normalization=None | score=0.952500 | + +以上实验中,使用z-score归一化的准确率都为最高。与最大最小值和不做归一化相比,有时不做归一化效果会比最大最小值归一化更好。归一化处理对准确率有一定的提升,在实验中,当数据的方差越大,特征维度的数据范围越不同时,归一化的提升效果更明显。 + +## 3 总结 + +KNN作为一种经典的模型,在数据在类别内部较为集中、边界较明显,线性可分的时候分类效果良好,能够达到90%以上。除了数据分布以外,k值的选择对KNN效果也有显著的影响,一般当k较小时,随着k增大准确率会迅速提升;当k增加到一定值,准确率增加会变慢并趋于平稳,上下波动。 + +影响KNN效果的另一因素是距离函数的选择。应该根据数据集的特征来选择距离函数,对于本次实验,欧氏距离最为适合。但欧式距离和曼哈顿距离都将不同特征维度上的差别等同看待,实际情况可能并非如此。当数据中的样本为某些特殊的类型,如文本的embedding向量时,余弦距离可能比欧氏距离或曼哈顿距离更合适。 + +最后,某些数据预处理方式可能对包括KNN在内的许多机器学习模型都有帮助。例如数据归一化,将特征的所有维度归一到同一范围,对于如神经网络、Logistic Regression等模型,可以防止某个维度数值太高而使得学习偏向这一维度,使学习效果不好;对于KNN,它可以防止某个维度在距离计算中权重占比过高、使得结果更容易受异常值影响等。 + +KNN是一种lazy-training模型,它在训练阶段只是将训练数据和标签保存,几乎所有的计算都在测试阶段完成,这样不仅需要更多的存储空间来存放数据,而且模型实际上没有学习到与任务相关的任何信息,没有学习到各个类别数据的特征。当数据在原始特征空间非线性可分时,KNN的效果便会很差(例如在数据集2上准确率仅有55%左右)。它也难以处理更加复杂的任务,例如图像分类,用比较像素点的相似性去比较两幅图像的相似性显然并不合理,当一张图调成灰度或者调暗亮度后,像素值就会有很大区别,而且这种比较也无法考虑到同一物体在图上的平移、旋转、缩放等等。除此之外,当训练集很大时,inference阶段也需要很大的计算量,速度慢,这和一般机器学习模型用长时间训练、线上实时预测的需求不符,因此,KNN的应用场景很有限。 \ No newline at end of file diff --git a/assignment-1/submission/17307130331/img/crossvalidationofk1.png b/assignment-1/submission/17307130331/img/crossvalidationofk1.png new file mode 100644 index 0000000000000000000000000000000000000000..8f1f2bd0b2e60ddb4f0e74e7168313e9508ecf78 Binary files /dev/null and b/assignment-1/submission/17307130331/img/crossvalidationofk1.png differ diff --git a/assignment-1/submission/17307130331/img/crossvalidationofk2.png b/assignment-1/submission/17307130331/img/crossvalidationofk2.png new file mode 100644 index 0000000000000000000000000000000000000000..72540b76e3e63a372c7c84c1d2c81067457c3a8a Binary files /dev/null and b/assignment-1/submission/17307130331/img/crossvalidationofk2.png differ diff --git a/assignment-1/submission/17307130331/img/datadistribution1.png b/assignment-1/submission/17307130331/img/datadistribution1.png new file mode 100644 index 0000000000000000000000000000000000000000..50be18f29f914f96590b8d6f7b402722ca20e2c8 Binary files /dev/null and b/assignment-1/submission/17307130331/img/datadistribution1.png differ diff --git a/assignment-1/submission/17307130331/img/datadistribution2.png b/assignment-1/submission/17307130331/img/datadistribution2.png new file mode 100644 index 0000000000000000000000000000000000000000..054b8162d2aae5523671b2329e07552aa0bffd23 Binary files /dev/null and b/assignment-1/submission/17307130331/img/datadistribution2.png differ diff --git a/assignment-1/submission/17307130331/img/datadistribution3.png b/assignment-1/submission/17307130331/img/datadistribution3.png new file mode 100644 index 0000000000000000000000000000000000000000..c62a155d865a298069d6f86c281e0406a3987bf9 Binary files /dev/null and b/assignment-1/submission/17307130331/img/datadistribution3.png differ diff --git a/assignment-1/submission/17307130331/img/datadistribution4.png b/assignment-1/submission/17307130331/img/datadistribution4.png new file mode 100644 index 0000000000000000000000000000000000000000..1a22d7fa4a23460987729f2e94368c93e860a4cf Binary files /dev/null and b/assignment-1/submission/17307130331/img/datadistribution4.png differ diff --git a/assignment-1/submission/17307130331/img/test.png b/assignment-1/submission/17307130331/img/test.png new file mode 100644 index 0000000000000000000000000000000000000000..845e9c7bda12cdf6a2bcc06f77a96c00023b94fe Binary files /dev/null and b/assignment-1/submission/17307130331/img/test.png differ diff --git a/assignment-1/submission/17307130331/img/testdistribution1.png b/assignment-1/submission/17307130331/img/testdistribution1.png new file mode 100644 index 0000000000000000000000000000000000000000..a2092a31a57b306f7717ace73d748710d9e61c8b Binary files /dev/null and b/assignment-1/submission/17307130331/img/testdistribution1.png differ diff --git a/assignment-1/submission/17307130331/img/testdistribution2.png b/assignment-1/submission/17307130331/img/testdistribution2.png new file mode 100644 index 0000000000000000000000000000000000000000..228c90210262bdebcca32aa440e6a6cc7dd7a77c Binary files /dev/null and b/assignment-1/submission/17307130331/img/testdistribution2.png differ diff --git a/assignment-1/submission/17307130331/img/train.png b/assignment-1/submission/17307130331/img/train.png new file mode 100644 index 0000000000000000000000000000000000000000..99f609e925d49742993be736f9c2e86abd15cb80 Binary files /dev/null and b/assignment-1/submission/17307130331/img/train.png differ diff --git a/assignment-1/submission/17307130331/source.py b/assignment-1/submission/17307130331/source.py new file mode 100644 index 0000000000000000000000000000000000000000..3e73072b0806ee8ad64c14a1fc2e240fa4644d63 --- /dev/null +++ b/assignment-1/submission/17307130331/source.py @@ -0,0 +1,161 @@ +import numpy as np +import matplotlib.pyplot as plt +import sys +class KNN: + + def __init__(self, norm='zscore', metric='euclidean', k=3): + self.train_data = None + self.train_label = None + self.num_of_class = None + self.k = k + self.norm = norm + self.metric = metric + self.min = None + self.max = None + self.mean = None + self.std = None + def get_params(self, deep=True): + return dict({'norm': self.norm, 'metric': self.metric, 'k': self.k}) + def manhattan_(self, x, y): + return np.sum(np.abs(x - y)) + + def euclidean_(self, x, y): + return np.sqrt(np.sum((x - y)**2)) + + def cosine(self, x, y): + return 1- np.dot(x, y)/(np.linalg.norm(x)* np.linalg.norm(y)) + + def zscore_norm_(self, data): + if self.mean is None: + self.mean = np.mean(data, axis=0) + if self.std is None: + self.std = np.std(data, axis=0) + temp = list(self.std) + for i in range(len(temp)): + if temp[i]==0: + temp[i]=1 + + return (data - self.mean)/np.array(temp) + + def minmax_norm_(self, data): + if self.min is None: + self.min = np.min(data, axis=0) + if self.max is None: + self.max = np.max(data, axis=0) + + temp = list(self.max-self.min) + for i in range(len(temp)): + if temp[i]==0: + temp[i]=1 + + return (data - self.min)/np.array(temp) + + def fit(self, train_data, train_label): + #print("training data shape: ", train_data.shape) + if self.norm == 'zscore': + self.train_data = self.zscore_norm_(train_data) + elif self.norm == 'minmax': + self.train_data = self.minmax_norm_(train_data) + else: + self.train_data = train_data + + self.train_label = train_label + self.num_of_class = len(set(train_label.tolist())) + #print("number of classes: ", self.num_of_class) + + def predict(self, test_data): + #print("testing data shape: ", test_data.shape) + self.test_label = np.array([]) + if self.norm == 'zscore': + self.test_data = self.zscore_norm_(test_data) + elif self.norm == 'minmax': + self.test_data = self.minmax_norm_(test_data) + else: + self.test_data = test_data + + + if self.metric== 'euclidean': + self.distance_matrix_ = np.array([[self.euclidean_(x, y) for y in self.train_data] for x in self.test_data]) + elif self.metric== 'manhattan': + self.distance_matrix_ = np.array([[self.manhattan_(x, y) for y in self.train_data] for x in self.test_data]) + elif self.metric== 'cosine': + self.distance_matrix_ = np.array([[self.cosine(x, y) for y in self.train_data] for x in self.test_data]) + + else: + self.distance_matrix_ = np.array([[self.euclidean_(x, y) for y in self.train_data] for x in self.test_data]) + + k_ = np.argpartition(self.distance_matrix_, self.k)[:, 0:self.k] + + self.test_label = np.argmax(np.array([np.bincount(self.train_label[k_][i], minlength = self.num_of_class) for i in range(k_.shape[0])]), axis=1) + + return self.test_label + def score(self, test_data, test_label): + y_pred = self.predict(test_data) + acc = np.mean(np.equal(y_pred, test_label)) + return acc + +def generate(): + n = 20 + mean1 = [0,0] + mean2 = [20,20] + mean3 = [-20,20] + cov1 = np.diag(np.random.randint(0,100,2)) + cov2 = np.diag(np.random.randint(0,100,2)) + cov3 = np.diag(np.random.randint(0,100,2)) + c1_x = np.random.multivariate_normal(mean = mean1, cov = cov1, size= 600) + c2_x = np.random.multivariate_normal(mean = mean2, cov = cov2, size= 400) + c3_x = np.random.multivariate_normal(mean = mean3, cov = cov3, size= 500) + x_data = np.concatenate([c1_x, c2_x, c3_x], axis=0) + x_data.shape + + y_data = np.concatenate([[0]*c1_x.shape[0], [1]*c2_x.shape[0], [2]*c3_x.shape[0]], axis=0) + y_data.shape + + idx = [i for i in range(len(x_data))] + np.random.shuffle(idx) + + x_data = x_data[idx] + y_data = y_data[idx] + + train_data = x_data[:int(0.8*x_data.shape[0])] + test_data = x_data[int(0.8*x_data.shape[0]): ] + train_label = y_data[:int(0.8*x_data.shape[0])] + test_label = y_data[int(0.8*x_data.shape[0]):] + np.save( + 'data.npy', + ( + (train_data, train_label), + (test_data, test_label) + ) + ) + +def read(): + (train_data, train_label), (test_data, test_label) = np.load( + 'data.npy', allow_pickle=True) + return (train_data, train_label), (test_data, test_label) + +def display(data, label, name): + num_of_class = len(list(set(label.tolist()))) + datas = {} + for i in set(label.tolist()): + datas[i] = data[label==i,:] + plt.scatter(datas[i][:,0], datas[i][:,1], label=str(i)) + + plt.savefig(f'img/{name}') + plt.show() + +if __name__ == "__main__": + if len(sys.argv) > 1 and sys.argv[1] == "g": + generate() + if len(sys.argv) > 1 and sys.argv[1] == "d": + (train_data, train_label), (test_data, test_label) = read() + display(train_data, train_label, 'train') + display(test_data, test_label, 'test') + else: + (train_data, train_label), (test_data, test_label) = read() + + model = KNN(k=9, metric='euclidean', norm='zscore') + # 选择距离计算公式、评估公式 + model.fit(train_data, train_label) + res = model.predict(test_data) + print("acc =", np.mean(np.equal(res, test_label))) \ No newline at end of file