diff --git a/assignment-1/submission/17307110367/README.md b/assignment-1/submission/17307110367/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f34b168681d1fdb43378914439a666a038baab1d --- /dev/null +++ b/assignment-1/submission/17307110367/README.md @@ -0,0 +1,163 @@ +# 课程报告 + 我的KNN模型**只用到了numpy包**,所以我的代码应该可以通过限定依赖包的自动测试。 + ## 一、数据集的生成和划分 + 首先我以如下参数生成了3个符合二维高斯分布的集合的数据集。 + + 第一类数据800个,标注为0: +$$ +\begin{array}{l} +\Sigma=\left[\begin{array}{cc} +73 & 0 \\\\ +0 & 22 +\end{array}\right] \\\\ +\mu=\left[\begin{array}{ll} +1 & 2 +\end{array}\right] +\end{array} +$$ + + + 第二类数据200个,标注为1: +$$ +\begin{array}{l} +\Sigma=\left[\begin{array}{cc} +21.2 & 0 \\\\ +0 & 32.1 +\end{array}\right] \\\\ +\mu=\left[\begin{array}{ll} +16 & -5 +\end{array}\right] +\end{array} +$$ + + 第三类数据1000个,标注为2: + +$$ +\begin{array}{l} +\Sigma=\left[\begin{array}{cc} +10 & 5 \\\\ +5 & 10 +\end{array}\right] \\\\ +\mu=\left[\begin{array}{ll} +10 & 22 +\end{array}\right] +\end{array} +$$ + + 将这些图片和对应的标注混合并打乱次序,就能够得到我们数据集。从其中取出80%(1600对)作为我们的训练集。 + 这是我生成的训练集 + +![train_1](./img/train_1.png) + +取出其中另外的20%(400对)作为我们的测试集。 +这是我生成的测试集 +![test_1](./img/test_1.png) + + ## 二、KNN模型的建立 + 我的KNN模型主要分为两部分。第一部分的fit函数将训练集分为训练集和验证集,根据KNN模型在验证集上的最优结果自动地选择最优的K值。第二部分的predict函数用已选择出的K值代入KNN模型中,预测测试集的标签。 + ### 2.1 fit函数的编写 + fit函数主要包含以下几个步骤: + (1)将已有的训练集的次序打乱,分出其中的20%作为验证集,80%作为测试集。 + + (2)遍历待选的K值,在暂时确定K值的情况下在验证集上测试模型的结果。若训练集数量小于20,则待选K值的范围是range(1,训练集数量,2);若训练集数量大于20,则待选K值的范围是range(1,20,2)。 + + (3)找到验证集预测准确率最高的模型所对应的K值。将该值作为后续在predict函数中运用的K值。 + + ### 2.2 predict函数的编写 + predict函数主要包含以下几个步骤: + (1)遍历测试集中的每一个点。当取出测试集中的某一点时,计算该点与训练集中的每个点的距离。 + + (2)对计算好的距离进行从小到大排序,取出前K个点 + + (3)统计前K个值中各个类别的数量 + + (4)数量最多的类别便是预测结果。 + + + ## 三、实验结果与分析 + 在命令行运行 python source.py g即可生成数据集并查看准确率结果。由于每次随机生成的数据集略有差异,每次的K值和准确率也略有差异。重复实验10次,结果如下表: + + +实验次数 |K值 |准确率 +---|---|--- +1| 11| 0.96 +2 | 9|0.9675 +3| 15| 0.955 +4 | 11|0.9475 +5| 5| 0.94 +6 | 5|0.94 +7| 11| 0.96 +8 | 7|0.945 +9| 7| 0.95 +10 | 19|0.955 + +取这10次实验准确率的均值,得到模型的最终准确率为0.952。 +最终的模型准确率较高。准确率不为1的原因是测试集中蓝色与橙色的点有着一定的交集,对于处于交集中的数据我们也很难分清楚数据点到底属于哪一个类别。这部分的失误对于KNN来说似乎是无法避免的。 + + +## 三、修改数据集进行实验探究 +### 3.1 修改高斯分布的距离 +我们的预期是:在其它条件不变的情况下,高斯分布的距离越大,数据分的越开,KNN越容易预测准确。高斯分布的距离越小,数据离得越近,KNN的准确率越低。 +#### (1)设置参数使得三个类别分的更开。 + + 修改第二类数据的均值,使得它与另外两类数据分的更开: +$$ +\mu_2 = [\begin{matrix}30 & -20\end{matrix}] +$$ + + 此时测试集的数据分布如下图: + ![test_2](./img/test_2.png) + 多次实验得到模型的准确率为0.99。符合我们的预期。 + +#### (2)设置参数使得三个类别离得更近。 + + 修改每一类数据的均值,使得它们离得更近: + +$$ +\mu_1 = [\begin{matrix}1 & 2\end{matrix}] +$$ + +$$ +\mu_2 = [\begin{matrix}15 & 0\end{matrix}] +$$ +$$ +\mu_3 = [\begin{matrix}10 & 10\end{matrix}] +$$ + + 此时测试集的数据分布如下图: + + ![test_3](./img/test_3.png) + 此时模型的准确率均值只有0.84。符合我们的预期。 + + ### 3.2 修改高斯分布的方差 + 我们的预期是:在其它条件不变的情况下,高斯分布的方差越大,数据越容易混淆,因此KNN的结果越差。高斯分布的方差越小,数据越集中,KNN的结果越好。 +#### 设置参数使得第二和第三类数据的协方差更大。 + 修改第二,三类数据的方差如下: +$$ +\begin{array}{l} +\Sigma=\left[\begin{array}{cc} +73 & 0 \\\\ +0 & 22 +\end{array}\right] +\end{array} +$$ + + 此时测试集的数据分布如下图: + + ![test_4](./img/test_4.png) + + 此时模型的准确率均值只有0.74,符合我们的预期。显然KNN的结果在这种情况下并不理想。 + + ### 3.3 修改数据的数量 + #### (1)使各类数据翻倍。 + 对第一,第二,第三类的训练和测试数据翻倍。 + 此时测试集的数据分布如下图: + ![test_5](./img/test_5.png) + + 对模型运行多次求均值,得到模型的准确率约为0.956。准确率相比于原先提升了一点点。这点提升微乎其微,背后的原因可能是数据翻倍时同样也使得数据间的交叠翻倍,对于这部分的交叠数据,模型很难判别正确。因此准确率没有什么改变。 + + #### (2)使第一类数据翻倍 + 只对第一类的训练和测试数据翻倍。 + 此时测试集的数据分布如下图: + ![test_6](./img/test_6.png) + 对模型运行多次求均值,得到模型的准确率约为0.96。可见准确率提升了一些。这是由于第一类的数据翻倍,导致在数据交叠区域数据点更倾向于被判别为第一类数据,因此准确率必定会有一定的提升 \ No newline at end of file diff --git a/assignment-1/submission/17307110367/img/.keep b/assignment-1/submission/17307110367/img/.keep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assignment-1/submission/17307110367/img/test_1.png b/assignment-1/submission/17307110367/img/test_1.png new file mode 100644 index 0000000000000000000000000000000000000000..eec47aac6c3a4a91fbb79b4e46c19b4dcaf56ad7 Binary files /dev/null and b/assignment-1/submission/17307110367/img/test_1.png differ diff --git a/assignment-1/submission/17307110367/img/test_2.png b/assignment-1/submission/17307110367/img/test_2.png new file mode 100644 index 0000000000000000000000000000000000000000..b7273b4820997fc7ccf6f1238e0f6d5dc53776f1 Binary files /dev/null and b/assignment-1/submission/17307110367/img/test_2.png differ diff --git a/assignment-1/submission/17307110367/img/test_3.png b/assignment-1/submission/17307110367/img/test_3.png new file mode 100644 index 0000000000000000000000000000000000000000..1389dfe352cc414a710dab394b0a5cbd5173c5e9 Binary files /dev/null and b/assignment-1/submission/17307110367/img/test_3.png differ diff --git a/assignment-1/submission/17307110367/img/test_4.png b/assignment-1/submission/17307110367/img/test_4.png new file mode 100644 index 0000000000000000000000000000000000000000..c33f255f7f5c059f8b3fa50fd5f02ebdf81a8204 Binary files /dev/null and b/assignment-1/submission/17307110367/img/test_4.png differ diff --git a/assignment-1/submission/17307110367/img/test_5.png b/assignment-1/submission/17307110367/img/test_5.png new file mode 100644 index 0000000000000000000000000000000000000000..f15a0935c5fc56cadf8f259b40b0b960831d256f Binary files /dev/null and b/assignment-1/submission/17307110367/img/test_5.png differ diff --git a/assignment-1/submission/17307110367/img/test_6.png b/assignment-1/submission/17307110367/img/test_6.png new file mode 100644 index 0000000000000000000000000000000000000000..528bd1180529c99a903a7affc9c08cbb6f0af35b Binary files /dev/null and b/assignment-1/submission/17307110367/img/test_6.png differ diff --git a/assignment-1/submission/17307110367/img/train_1.png b/assignment-1/submission/17307110367/img/train_1.png new file mode 100644 index 0000000000000000000000000000000000000000..580335fd8c4753ea64df8189940b63039eb02168 Binary files /dev/null and b/assignment-1/submission/17307110367/img/train_1.png differ diff --git a/assignment-1/submission/17307110367/source.py b/assignment-1/submission/17307110367/source.py new file mode 100644 index 0000000000000000000000000000000000000000..bc70f0c0f1ea938f09f35eee810449c19758ef39 --- /dev/null +++ b/assignment-1/submission/17307110367/source.py @@ -0,0 +1,172 @@ +import sys +import numpy as np +import matplotlib.pyplot as plt + +class KNN: + + def __init__(self): + self.train_data = None + self.train_label = None + self.k = None + + def fit(self, train_data, train_label): + self.train_data = train_data + self.train_label = train_label + # 将训练集打乱 + data_size = self.train_data.shape[0] + shuffled_indices = np.random.permutation(data_size) + shuffled_data = train_data[shuffled_indices] + shuffled_label = train_label[shuffled_indices] + # test_ratio为测试集所占的百分比,划分训练集和验证集 + test_ratio = 0.2 + test_set_size = int(data_size * test_ratio) + valid_data = shuffled_data[:test_set_size] + valid_label = shuffled_label[:test_set_size] + training_data = shuffled_data[test_set_size:] + training_label = shuffled_label[test_set_size:] + # 在验证集上对不同的K值进行测试 + record ={} + if training_data.shape[0] < 20: + k_number = training_data.shape[0] + else: + k_number = 20 + for k in range(1,k_number,2): + data_size = training_data.shape[0] + predict_result = np.array([]) + for i in range(valid_data.shape[0]): + diff = np.tile(valid_data[i], (data_size, 1)) - training_data + sqdiff = diff ** 2 + squareDist = np.sum(sqdiff, axis=1) + dist = squareDist ** 0.5 + # test_data到其它点的距离 + sorteddiffdist = np.argsort(dist) + # 对这些距离从小到大排序 + classCount = {} + for j in range(k): + Label = training_label[sorteddiffdist[j]] + classCount[Label] = classCount.get(Label, 0) + 1 + # 统计距离中前K个值中各个类别的数量 + maxCount = 0 + for key, value in classCount.items(): + if value > maxCount: + maxCount = value + result = key + predict_result = np.append(predict_result, result) + acc = np.mean(np.equal(predict_result, valid_label)) + record[k] = acc + # 取验证准确率最高的K值作为K值 + maxCount = 0 + for key, value in record.items(): + if value > maxCount: + maxCount = value + k_result = key + print("k=",k_result) + self.k = k_result + + def predict(self, test_data): + data_size = self.train_data.shape[0] + predict_result = np.array([]) + for i in range(test_data.shape[0]): + diff = np.tile(test_data[i],(data_size,1)) - self.train_data + sqdiff = diff **2 + squareDist = np.sum(sqdiff, axis =1) + dist = squareDist **0.5 + # test_data到其它点的距离 + sorteddiffdist = np.argsort(dist) + # 对这些距离从小到大排序 + classCount ={} + for j in range(self.k): + Label = self.train_label[sorteddiffdist[j]] + classCount[Label] = classCount.get(Label,0) + 1 + # 统计距离中前K个值中各个类别的数量 + maxCount = 0 + for key, value in classCount.items(): + if value > maxCount: + maxCount = value + result = key + predict_result = np.append(predict_result,result) + # 数量最多的就是预测的结果 + return predict_result + + +def generate(): + mean = (1, 2) + cov = np.array([[73, 0], [0, 22]]) + x = np.random.multivariate_normal(mean, cov, (800,)) + #x = np.random.multivariate_normal(mean, cov, (1600,)) + + mean = (16, -5) + #mean = (30, -20) + #mean = (15, 0) + cov = np.array([[21.2, 0], [0, 32.1]]) + #cov = np.array([[73, 0], [0, 22]]) + y = np.random.multivariate_normal(mean, cov, (200,)) + #y = np.random.multivariate_normal(mean, cov, (400,)) + + mean = (10, 22) + #mean = (10,10) + cov = np.array([[10, 5], [5, 10]]) + #cov = np.array([[73, 0], [0, 22]]) + z = np.random.multivariate_normal(mean, cov, (1000,)) + #z = np.random.multivariate_normal(mean, cov, (2000,)) + + idx = np.arange(2000) + #idx = np.arange(2800) + np.random.shuffle(idx) + data = np.concatenate([x, y, z]) + label = np.concatenate([ + np.zeros((800,), dtype=int), + np.ones((200,), dtype=int), + np.ones((1000,), dtype=int) * 2 + ]) + # label = np.concatenate([ + # np.zeros((1600,), dtype=int), + # np.ones((200,), dtype=int), + # np.ones((1000,), dtype=int) * 2 + # ]) + # data = data[idx] + # label = label[idx] + + train_data, test_data = data[:1600, ], data[1600:, ] + train_label, test_label = label[:1600, ], label[1600:, ] + # train_data, test_data = data[:2240, ], data[2240:, ] + # train_label, test_label = label[:2240, ], label[2240:, ] + np.save("data.npy", ( + (train_data, train_label), (test_data, test_label) + )) + + +def read(): + (train_data, train_label), (test_data, test_label) = np.load("data.npy", allow_pickle=True) + return (train_data, train_label), (test_data, test_label) + + +def display(data, label, name): + datas = [[], [], []] + for i in range(len(data)): + datas[label[i]].append(data[i]) + + for each in datas: + each = np.array(each) + plt.scatter(each[:, 0], each[:, 1]) + plt.savefig(f'./{name}') + plt.show() + + +if __name__ == "__main__": + if len(sys.argv) > 1 and sys.argv[1] == "g": + generate() + if len(sys.argv) > 1 and sys.argv[1] == "d": + (train_data, train_label), (test_data, test_label) = read() + display(train_data, train_label, 'train') + display(test_data, test_label, 'test') + else: + (train_data, train_label), (test_data, test_label) = read() + + model = KNN() + model.fit(train_data, train_label) + res = model.predict(test_data) + print("acc =", np.mean(np.equal(res, test_label))) + + +