diff --git a/assignment-2/submission/18307130116/README.md b/assignment-2/submission/18307130116/README.md new file mode 100644 index 0000000000000000000000000000000000000000..60d6a7aaf412e4f028a1124ff7cc63b243e2c2d7 --- /dev/null +++ b/assignment-2/submission/18307130116/README.md @@ -0,0 +1,160 @@ +# FNN实现 + +[toc] + +## 模型实现 + +各算子实现参考[算子导数推导部分](##算子导数推导),网络结构如下图所示 + + + +根据上图对应的模型,建立顺序将算子拼接在一起,并在反向传播时从loss开始逐层回传,基本没什么难点,最终模型构建了函数 + +$log(softmax(W_3\sigma(W_2\sigma(W_1X)))$ + +## 模型训练 + +在运行实现给出的`numpy_minst.py`,共运行了三个epoch,对应的准确率和loss变化情况如下 + +| epoch | Accuracy | +| ----- | -------- | +| 0 | 94.49% | +| 1 | 96.47% | +| 2 | 96.58% | + + + +### 学习率和epoch的影响 + +观察发现,loss下降到一定范围后开始上下抖动,推测其原因为接近极值点时学习率过大,为达到更优的性能,我调小的学习率并增大了epoch数量,得到结果如下,并做了不更改学习率仅调整epoch数量的对比实验其中i为[(i-1)*5, i\*5)中位数,20为最终结果 + +| epoch | Accuracy(learning_rate = 0.1) | Accuracy(learning_rate = 0.05) | Accuracy(learning_rate = 0.1+0.05) | +| ----- | ----------------------------- | ------------------------------ | ---------------------------------- | +| 0 | 97.27% | 95.85% | 96.59% | +| 5 | 97.93% | 97.85% | 97.91% | +| 10 | 98.03% | 98.03% | 98.18% | +| 15 | 98.12% | 98.09% | 98.18% | +| 20 | 98.12% | 98.19% | 98.18% | + +
+
+
+
+
+
+
+## 优化器
+
+### Adam原理
+
+类似于实验部分做的对学习率的调整,Adam优化器作为一种很多情况下常常使用到的优化器,在自动调整学习率这个点较为出彩,基本已经成为了很多模型优化问题的默认优化器,另一方面初始的学习率选择也影响到了优化过程。
+
+Adam优化器的基本公式为$\theta_t = \theta_{t-1}-\alpha*\hat m_t/(\sqrt{\hat v_t}+\epsilon)$,其中$\hat m_t$以指数移动平均的方式估计样本的一阶矩,并通过超参$\beta_1$的t次方削减初始化为0导致偏差的影响,其基本公式如下,$g_t$为梯度值
+
+$\hat m_t = m_t/(1-\beta_1^t)$,$m_t = \beta_1m_{t-1}+(1-\beta_1)g_t$
+
+类似的计算$\hat v = v_t/(1-\beta_2^t),v_t = \beta_2v_{t-1}+(1-\beta_2)g_t^2$
+
+$\epsilon$目的是为了防止除数变成0
+
+### Momentum原理
+
+Momentum优化器的思路和Adam类似,但是并不考虑标准差对学习率的影响,同样利用滑动窗口机制,指数加权动量,赋给当前梯度一个较小的权重,从而平滑梯度在极值点附近的摆动,更能够接近极值点
+
+其公式如下
+
+$v_t = \beta v_{t-1}+(1-\beta)dW$
+
+$W = W - \alpha v_t$
+
+### 实现
+
+有了如上公式,我在`numpy_mnist.py`中设计了Adam类和Momentum类,由于并不能对`numpy_fnn.py`进行修改,对这两个优化器的实现大体思路变成了,针对每一个变量生成一个优化器,并通过内部变量记录上一轮迭代时参数信息,并计算后返回新的参数,例如Moment的使用呈如下格式:
+
+`model.W1 = W1_opt.optimize(model.W1, model.W1_grad)`
+
+即计算新的权值后,赋给模型
+
+### 实验比较
+
+我们将两个优化器我们同之前获得的最优结果,`lr` = 0.1+0.05方式作比较,loss和Accuracy变化如下
+
+| epoch | Accuracy(learning_rate = 0.1+0.05) | Accuracy(Adam, $\alpha = 0.001$) | Accuracy(Momentum,$\alpha = 0.1$) |
+| ----- | ---------------------------------- | ---------------------------------- | --------------------------------- |
+| 0 | 96.59% | 97.46% | 97.01% |
+| 5 | 97.91% | 97.69% | 97.95% |
+| 10 | 98.18% | 97.80% | 98.07% |
+| 15 | 98.18% | 97.98% | 98.22% |
+| 20 | 98.18% | 98.04% | 98.36% |
+
+
+
+### 分析
+
+从表格和loss变化情况来看,Momentum的效果明显优于手动学习率调整,而Adam的效果甚至不如恒定学习率,查看论文中的算法后,我排除了实现错误的可能性,查找了相关资料,发现了这样的一段话:
+
+[简单认识Adam]: https://www.jianshu.com/p/aebcaf8af76e "Adam的缺陷与改进"
+
+虽然Adam算法目前成为主流的优化算法,不过在很多领域里(如计算机视觉的对象识别、NLP中的机器翻译)的最佳成果仍然是使用带动量(Momentum)的SGD来获取到的。Wilson 等人的论文结果显示,在对象识别、字符级别建模、语法成分分析等方面,自适应学习率方法(包括AdaGrad、AdaDelta、RMSProp、Adam等)通常比Momentum算法效果更差。
+
+根据该资料的说法,本次实验手写数字识别应划归为对象识别,自适应学习率方法确为效果更差,Adam的好处在于,对于不稳定目标函数,效果很好,因此,从这里可以看到,优化器选择应该针对实际问题类型综合考量
\ No newline at end of file
diff --git a/assignment-2/submission/18307130116/img/Adam.png b/assignment-2/submission/18307130116/img/Adam.png
new file mode 100644
index 0000000000000000000000000000000000000000..76c571e3ea0c18e00faf75a5f078350cb86a1159
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Adam.png differ
diff --git a/assignment-2/submission/18307130116/img/Figure_1.png b/assignment-2/submission/18307130116/img/Figure_1.png
new file mode 100644
index 0000000000000000000000000000000000000000..683414e2e126545f2a851da9a05be74eb5261b13
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_1.png differ
diff --git a/assignment-2/submission/18307130116/img/Figure_2.png b/assignment-2/submission/18307130116/img/Figure_2.png
new file mode 100644
index 0000000000000000000000000000000000000000..bef71ab36ae8d83504f84243e3d64082b8fcab5d
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_2.png differ
diff --git a/assignment-2/submission/18307130116/img/Figure_3.png b/assignment-2/submission/18307130116/img/Figure_3.png
new file mode 100644
index 0000000000000000000000000000000000000000..639051608449345a12b51083243e78dcfa6a4f70
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_3.png differ
diff --git a/assignment-2/submission/18307130116/img/Figure_4.png b/assignment-2/submission/18307130116/img/Figure_4.png
new file mode 100644
index 0000000000000000000000000000000000000000..fe141456a1e96e256569cdcb37a87e2d4b6f0e6b
Binary files /dev/null and b/assignment-2/submission/18307130116/img/Figure_4.png differ
diff --git a/assignment-2/submission/18307130116/img/matmul.png b/assignment-2/submission/18307130116/img/matmul.png
new file mode 100644
index 0000000000000000000000000000000000000000..e3e6d769ef44203d80817a2928a5b1ea2a533e06
Binary files /dev/null and b/assignment-2/submission/18307130116/img/matmul.png differ
diff --git a/assignment-2/submission/18307130116/img/model.png b/assignment-2/submission/18307130116/img/model.png
new file mode 100644
index 0000000000000000000000000000000000000000..72c73828f7d70be8ea8d3f010b27bc7ada0a4139
Binary files /dev/null and b/assignment-2/submission/18307130116/img/model.png differ
diff --git a/assignment-2/submission/18307130116/img/momentum.png b/assignment-2/submission/18307130116/img/momentum.png
new file mode 100644
index 0000000000000000000000000000000000000000..b9b0b145e362898c6a6cf5f379fe0459abb9fa28
Binary files /dev/null and b/assignment-2/submission/18307130116/img/momentum.png differ
diff --git a/assignment-2/submission/18307130116/img/softmax1.png b/assignment-2/submission/18307130116/img/softmax1.png
new file mode 100644
index 0000000000000000000000000000000000000000..56c1a6c77141e66a1970dc8d7d66d00c891a74d2
Binary files /dev/null and b/assignment-2/submission/18307130116/img/softmax1.png differ
diff --git a/assignment-2/submission/18307130116/img/softmax2.png b/assignment-2/submission/18307130116/img/softmax2.png
new file mode 100644
index 0000000000000000000000000000000000000000..277f06da303ed92389cc7620e89ee25bf5b1c7e1
Binary files /dev/null and b/assignment-2/submission/18307130116/img/softmax2.png differ
diff --git a/assignment-2/submission/18307130116/numpy_fnn.py b/assignment-2/submission/18307130116/numpy_fnn.py
new file mode 100644
index 0000000000000000000000000000000000000000..13397e1977d0b8bf530900861e08a2176816f780
--- /dev/null
+++ b/assignment-2/submission/18307130116/numpy_fnn.py
@@ -0,0 +1,185 @@
+import numpy as np
+
+
+class NumpyOp:
+
+ def __init__(self):
+ self.memory = {}
+ self.epsilon = 1e-12
+
+
+class Matmul(NumpyOp):
+
+ def forward(self, x, W):
+ """
+ x: shape(N, d)
+ w: shape(d, d')
+ """
+ self.memory['x'] = x
+ self.memory['W'] = W
+ h = np.matmul(x, W)
+ return h
+
+ def backward(self, grad_y):
+ """
+ grad_y: shape(N, d')
+ """
+
+ ####################
+ # code 1 #
+ ####################
+ grad_x = np.matmul(grad_y,self.memory['W'].T)
+ grad_W = np.matmul(self.memory['x'].T, grad_y)
+
+ return grad_x, grad_W
+
+
+class Relu(NumpyOp):
+
+ def forward(self, x):
+ self.memory['x'] = x
+ return np.where(x > 0, x, np.zeros_like(x))
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 2 #
+ ####################
+ grad_x = np.where(self.memory['x'] > 0, grad_y, np.zeros_like(self.memory['x']))
+ return grad_x
+
+
+class Log(NumpyOp):
+
+ def forward(self, x):
+ """
+ x: shape(N, c)
+ """
+
+ out = np.log(x + self.epsilon)
+ self.memory['x'] = x
+
+ return out
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 3 #
+ ####################
+ grad_x =(1/(self.memory['x'] + self.epsilon)) *grad_y
+
+ return grad_x
+
+class Softmax(NumpyOp):
+ """
+ softmax over last dimension
+ """
+
+ def forward(self, x):
+ """
+ x: shape(N, c)
+ """
+ self.memory['x'] = x
+ ####################
+ # code 4 #
+ ####################
+ exp = np.exp(self.memory['x'])
+ one = np.ones((self.memory['x'].shape[1], self.memory['x'].shape[1]))
+ h = 1./np.matmul(exp,one)
+ out = h * exp
+ return out
+
+ def backward(self, grad_y):
+ """
+ grad_y: same shape as x
+ """
+
+ ####################
+ # code 5 #
+ ####################
+ exp = np.exp(self.memory['x'])
+ one = np.ones((self.memory['x'].shape[1], self.memory['x'].shape[1]))
+ h = 1./np.matmul(exp,one)
+ h_grad = -h * h
+ grad_x = grad_y* exp * h + np.matmul(grad_y * exp * h_grad, one) * exp
+ return grad_x
+
+
+class NumpyLoss:
+
+ def __init__(self):
+ self.target = None
+
+ def get_loss(self, pred, target):
+ self.target = target
+ return (-pred * target).sum(axis=1).mean()
+
+ def backward(self):
+ return -self.target / self.target.shape[0]
+
+
+class NumpyModel:
+ def __init__(self):
+ self.W1 = np.random.normal(size=(28 * 28, 256))
+ self.W2 = np.random.normal(size=(256, 64))
+ self.W3 = np.random.normal(size=(64, 10))
+
+
+ # 以下算子会在 forward 和 backward 中使用
+ self.matmul_1 = Matmul()
+ self.relu_1 = Relu()
+ self.matmul_2 = Matmul()
+ self.relu_2 = Relu()
+ self.matmul_3 = Matmul()
+ self.softmax = Softmax()
+ self.log = Log()
+
+ # 以下变量需要在 backward 中更新
+ self.x1_grad, self.W1_grad = None, None
+ self.relu_1_grad = None
+ self.x2_grad, self.W2_grad = None, None
+ self.relu_2_grad = None
+ self.x3_grad, self.W3_grad = None, None
+ self.softmax_grad = None
+ self.log_grad = None
+
+
+ def forward(self, x):
+ x = x.reshape(-1, 28 * 28)
+
+ ####################
+ # code 6 #
+ ####################
+ x = self.matmul_1.forward(x, self.W1)
+ x = self.relu_1.forward(x)
+ x = self.matmul_2.forward(x, self.W2)
+ x = self.relu_2.forward(x)
+ x = self.matmul_3.forward(x, self.W3)
+ x = self.softmax.forward(x)
+ x = self.log.forward(x)
+ return x
+
+ def backward(self, y):
+ ####################
+ # code 7 #
+ ####################
+ self.log_grad = self.log.backward(y)
+ self.softmax_grad = self.softmax.backward(self.log_grad)
+ self.x3_grad, self.W3_grad = self.matmul_3.backward(self.softmax_grad)
+ self.relu_2_grad = self.relu_2.backward(self.x3_grad)
+ self.x2_grad, self.W2_grad = self.matmul_2.backward(self.relu_2_grad)
+ self.relu_1_grad = self.relu_1.backward(self.x2_grad)
+ self.x1_grad, self.W1_grad = self.matmul_1.backward(self.relu_1_grad)
+
+
+
+ def optimize(self, learning_rate):
+ self.W1 -= learning_rate * self.W1_grad
+ self.W2 -= learning_rate * self.W2_grad
+ self.W3 -= learning_rate * self.W3_grad
diff --git a/assignment-2/submission/18307130116/numpy_mnist.py b/assignment-2/submission/18307130116/numpy_mnist.py
new file mode 100644
index 0000000000000000000000000000000000000000..dc5fdaa3b169f4a5ec77458993318b1b875ac400
--- /dev/null
+++ b/assignment-2/submission/18307130116/numpy_mnist.py
@@ -0,0 +1,97 @@
+import numpy as np
+from numpy_fnn import NumpyModel, NumpyLoss
+from utils import download_mnist, batch, get_torch_initialization, plot_curve, one_hot
+
+def mini_batch(dataset, batch_size=128, numpy=False, drop_last=False):
+ data = []
+ label = []
+ dataset_num = dataset.__len__()
+ idx = np.arange(dataset_num)
+ np.random.shuffle(idx)
+ for each in dataset:
+ data.append(each[0].numpy())
+ label.append(each[1])
+ label_numpy = np.array(label)[idx]
+ data_numpy = np.array(data)[idx]
+
+ result = []
+ for iter in range(dataset_num // batch_size):
+ result.append((data_numpy[iter*batch_size:(iter+1)*batch_size], label_numpy[iter*batch_size:(iter+1)*batch_size]))
+ if drop_last == False:
+ result.append((data_numpy[(iter+1)*batch_size:dataset_num], label_numpy[(iter+1)*batch_size:dataset_num]))
+ return result
+
+class Adam:
+ def __init__(self, weight, lr=0.0015, beta1=0.9, beta2=0.999, epsilon=1e-8):
+ self.theta = weight
+ self.lr = lr
+ self.beta1 = beta1
+ self.beta2 = beta2
+ self.epislon = epsilon
+ self.m = 0
+ self.v = 0
+ self.t = 0
+
+ def optimize(self, grad):
+ self.t += 1
+ self.m = self.beta1 * self.m + (1 - self.beta1) * grad
+ self.v = self.beta2 * self.v + (1 - self.beta2) * grad * grad
+ self.m_hat = self.m / (1 - self.beta1 ** self.t)
+ self.v_hat = self.v / (1 - self.beta2 ** self.t)
+ self.theta -= self.lr * self.m_hat / (self.v_hat ** 0.5 + self.epislon)
+ return self.theta
+
+class Momentum:
+ def __init__(self, lr=0.1, beta=0.9):
+ self.lr = lr
+ self.beta = beta
+ self.v = 0
+
+ def optimize(self, weight, grad):
+ self.v = self.beta*self.v + (1-self.beta)*grad
+ weight -= self.lr * self.v
+ return weight
+
+def numpy_run():
+ train_dataset, test_dataset = download_mnist()
+
+ model = NumpyModel()
+ numpy_loss = NumpyLoss()
+ model.W1, model.W2, model.W3 = get_torch_initialization()
+ W1_opt = Momentum()
+ W2_opt = Momentum()
+ W3_opt = Momentum()
+
+
+ train_loss = []
+
+ epoch_number = 20
+
+ for epoch in range(epoch_number):
+ for x, y in mini_batch(train_dataset):
+ y = one_hot(y)
+
+ y_pred = model.forward(x)
+ loss = numpy_loss.get_loss(y_pred, y)
+
+ model.backward(numpy_loss.backward())
+ # if epoch >= 10:
+ # learning_rate = 0.05
+ # else:
+ # learning_rate = 0.1
+ # model.optimize(learning_rate)
+ model.W1 = W1_opt.optimize(model.W1, model.W1_grad)
+ model.W2 = W2_opt.optimize(model.W2, model.W2_grad)
+ model.W3 = W3_opt.optimize(model.W3, model.W3_grad)
+
+ train_loss.append(loss.item())
+
+ x, y = batch(test_dataset)[0]
+ accuracy = np.mean((model.forward(x).argmax(axis=1) == y))
+ print('[{}] Accuracy: {:.4f}'.format(epoch, accuracy))
+
+ plot_curve(train_loss)
+
+
+if __name__ == "__main__":
+ numpy_run()