diff --git a/assignment-3/submission/18300110042/README.md b/assignment-3/submission/18300110042/README.md new file mode 100644 index 0000000000000000000000000000000000000000..7ace1830b714f8e68d4000bd00a9ea18464a28dd --- /dev/null +++ b/assignment-3/submission/18300110042/README.md @@ -0,0 +1,619 @@ +# 课程报告 +这是`prml-21-spring/assignment-3` 的课程报告,我的代码在 [source.py](source.py) 中。 + +这次报告的主要内容如下: + +1. 聚类算法 + - k-means + - GMM (Gaussian Mixture Model) + +2. 基础实验(给定聚簇数量) + - 实验设计 + - 实验结果 + +3. 自动选择聚簇数量的实验 + - 算法 + - 实验设计 + - 实验结果 + + +## 聚类算法 +本次报告主要围绕两种聚类算法展开: `k-means` 和 `GMM (Gaussian Mixture Model)` 。 + +以下分别介绍实验使用的算法。 + +### k-means +#### 算法简介 +`k-means` 是一种无监督的聚类算法,每一类(簇,cluster)以中心点为典型成员,通过迭代的方法,寻找 `cost function` 的局部最小值并收敛(后文 `自动选择聚簇数量-算法-Gap Statistic` 处有进一步说明)。一般使用欧氏距离(`Euclidean Distance`)作为衡量差异的标准,即不同的点到簇中心的距离。 + + +k_means_diagram + +如图为 `k-means` 算法的流程图。 + +记样本数据量为 $N$ , `k-means` 算法的步骤为: +1. 给定 $K$ 个聚簇 $\mathcal{C_k}$,随机初始化对应的簇中心 $c_k, \space k = 1, \cdots, K$ ; + +2. 分别计算每个簇中心和每个样本点 $x_i, \space i = 1, \cdots, N$ 的欧式距离的平方,即 $$D = d^2 = ||x_i - c_j||^2;$$ + +3. 把每个样本 $x_i$ 分配到距离最近的中心点 $c_k$ 所属的簇 $\mathcal{C_k}$ ; + +4. 通过以下公式更新簇中心(记 $N_k = |\mathcal{C_k}|$ 为 $\mathcal{C_k}$ 中元素个数): $$c_k = \frac{\sum_{x_i \in\mathcal{C_k}} x_i}{N_k} ;$$ + +5. 重复以上 `3` 、 `4` 步,直到满足收敛准则,即经过一次迭代后,聚簇中心的变化小于特定的 `threshold` (一个很小的正数 $\epsilon$ )。 + +记 $$t = [\sum_{j, j+1} abs(c_{j, j+1})^2]^{\frac{1}{2}} ,$$ + +其中 $j$ 指前一次迭代的次数,当 $t < \epsilon$ 时,判断模型收敛。 + +`k-means` 算法相对比较简单,复杂度较低;但执行的是硬分类算法(hard clustering algorithm),灵活度较低,在实行过程中由于数据分布、数据量等原因,可能出现空簇(empty cluster)的情况,需要解决;对随机初始化的簇中心敏感,不同的初始值会导致最终收敛的结果有较大差异;适合分类的数据分布有限;对噪声、离群点敏感。 + +#### 优化 +为了部分解决以上 `k-means` 的一些问题,我们对算法进行了一定的优化: + +- 首先,对于第 `1` 步,初始化簇中心,除了 `random` 的方法外,增加了 `k-means++` 的方法,改善簇中心的初始化(参考 [ [Arthur & Vassilvitskii, 2007](https://courses.cs.duke.edu/cps296.2/spring07/papers/kMeansPlusPlus.pdf "k-means++: The Advantages of Careful Seeding") ] )。 + +`k-means++` 的步骤为:给定聚簇数量 $K$ , +1)随机选取一个簇中心 $c_1$ ; + +2)根据样本点到已确定的簇中心的距离(欧式距离的平方)计算概率分布,而后根据概率分布抽取下一个簇中心,直到抽取完 $K$ 个簇中心; + +具体的,记已确定的簇中心为 $k$ 个, + +当 $k=1$ 时,分别计算所有样本点 $x_i$ 到 $c_k$ 的欧式距离的平方($k=1$),即 $d_{i}^2 = ||x_i - c_1||^2, \space i = 1, \cdots, N$ ,得到不同样本点成为下一个簇中心的概率,即 $p(c_{k+1} = x_i) = \frac{d_{i}^2}{\sum_{i=1}^{N}{d_{i}^2}}$ ,根据这个概率分布抽取下一个簇中心; + +当 $k > 1$ 时,分别计算所有样本到已确定的 $k$ 个簇中心的欧式距离的平方,而后令 $d_{i}^2 = \min\{d_{ik}^2\} = \min\{||x_i - c_k||^2\}, \space k = 1,2, \cdots$ ,得到不同样本点成为下一个簇中心的概率,即 $p(c_{k+1} = x_i) = \frac{d_{i}^2}{\sum_{i=1}^{N}{d_{i}^2}}$ ,根据该概率分布抽取下一个簇中心。 + +当 $k = K$ 时,簇中心的初始化完成。 + + +- 其次,对于 `k-means` 运算过程中出现空簇的问题,在样本量最大的簇中挑选离中心点距离最大的点,作为新的簇中心。 + + +### GMM (Gaussian Mixture Model) +#### 算法简介 +相比 `k-means` 算法, `GMM` 是一种软分类算法(soft clustering algorithm),混合模型的每一个子模型可以看作一个有两个参数(均值 `mean` 和(协)方差 `(co)variance`)的高斯生成模型。 + +`GMM` 模型通过分别估计概率分布参数(均值、(协)方差)进行分类。模型学习参数的方法是 `EM算法` 。 + +GMM_diagram + +如图为 `GMM` 算法的流程图。 + +关于 `GMM` 的算法步骤,首先进行如下定义: + +记样本数据量为 $N$ , $x_i, \space i = 1, \cdots, N$ 为不同的样本点; + +$K$ 为混合模型中子高斯模型的数量,不同的子模型分别记为 $\mathcal{C_k}, \space k = 1, \cdots, K$ ; + +对于整个模型,其参数可以记为 $\theta = (\boldsymbol{\mu}, \boldsymbol{\Sigma}, \boldsymbol{\alpha})$ ,其中 $\boldsymbol{\mu}, \boldsymbol{\Sigma}$ 包含了每个子模型的两个参数(均值和协方差); $\boldsymbol{\alpha}$ 指每个子模型在混合模型中发生的概率。 + +对于子模型 $\mathcal{C_k}$ ,其参数 $\theta_k = (\mu_k, \Sigma_k)$ , + +其中, $\mu_k, \Sigma_k$ 即对应高斯分布 $\mathcal{N}(\mu_k, \Sigma_k)$ 的两个参数,当分类数据为一维时, $\Sigma_k$ 即 $\sigma_k^2$ ; + +记 $\alpha_k$ 为子模型 $\mathcal{C_k}$ 发生的概率,即样本数据属于子模型 $\mathcal{C_k}$ 的(先验)概率, $\alpha_k \ge 0, \space \sum_{k=1}^{K} {\alpha_k} = 1$ ; + + +记 $\phi(x_i|\theta_k)$ 为 $\mathcal{C_k}$ 的高斯概率密度函数(Probability Density Function, `PDF` ), + +当样本数据为一维数据时, +$$\phi(x_i|\theta_k) = \frac{1}{\sqrt{2\pi} \sigma} \exp{ \{- \frac{(x_i - \mu)^2}{2 \sigma^2} \} }.$$ + +当样本数据为多维数据时,记数据维度为 $D$ , + +$$\phi(x_i|\theta_k) = \frac{1}{{(2\pi)}^{\frac{D}{2}} |\Sigma|^{\frac{1}{2}}} \exp{\{- \frac{(x_i - \mu)^T \Sigma^{-1} (x_i - \mu)}{2} \}} ;$$ + + +`GMM` 算法的步骤为: +1. 初始化 $K$ 个子模型 $\mathcal{C_k}$ 的参数, $k = 1, \cdots, K$ ,即均值 $\mu_k$ ,方差 $\sigma_k^2$ (协方差矩阵 $\Sigma_k$ ),以及先验概率密度函数 $\alpha_k$ ; + +其中, + +1)对 $\mu_k$ 的初始化方法分为两种,(a)随机取 $K$ 个样本作为均值,具体的,将样本数据顺序随机 `shuffle` ,然后取前 $K$ 个;(b)通过 `k-means` 计算得出 $K$ 个簇中心分别作为 $\mu_k, \space k = 1, \cdots, K$ ; + +2)对 $\Sigma_k$ 的初始化,由初始化的 $\mu_k$ ,按 `k-means` 相同的方法,将样本分入距离最近的 $\mu_k$ 所在的簇 $\mathcal{C_k}$ ,而后根据该簇的样本估计 $\Sigma_k$ ; + +3)对 $\alpha_k$ 的初始化, $\alpha_k = \frac{N_k}{N}$ ,其中 $N_k$ 即属于 $\mathcal{C_k}$ 的样本数量。 + + +2. 对于每个样本 $x_i$ 以及每个子模型 $\mathcal{C_k}$,计算后验分布 $\gamma_{ik} = p(\mathcal{C_k}|x_i)$ ,即 $x_i \in \mathcal{C_k}$ 的概率 (`E-step`): + +$$\begin{align} \gamma_{ik} &= p(\mathcal{C_k}|x_i) \\\\ &= \frac{p(x_i|\mathcal{C_k}) \cdot p(\mathcal{C_k})}{p(x_i)} \\\\ &= \frac{\phi(x_i|\theta_k) \cdot \alpha_k}{\sum_{k=1}^{K}{\phi(x_i|\theta_k) \cdot \alpha_k }} ; \end{align}$$ + +3. 通过最大似然估计更新模型参数 $\theta_k, \space k = 1, \cdots, K$ (`M-step`): +$$\begin{align} \alpha_k &= \frac{\sum_{i=1}^{N}{\gamma_{ik}}}{N}; \\\\ \mu_k &= \frac{\sum_{i=1}^{N}{\gamma_{ik}} \cdot x_i}{\sum_{i=1}^{N}{\gamma_{ik}}}; \\\\\Sigma_k &= \frac{\sum_{i=1}^{N}{\gamma_{ik}(x_i - \mu_k)(x_i - \mu_k)^T}}{\sum_{i=1}^{N}{\gamma_{ik}}}; \end{align}$$ + +4. 重复以上 `3` 、 `4` (即 `E-step` 和 `M-step` ),直到满足收敛条件,即定义一个很小的正数 $\epsilon$ 作为 `threshold` ,当 $||\theta_{j+1} - \theta_{j}|| < \epsilon$ 时(其中 $j$ 指前一次迭代的次数),即经过一次迭代时参数变化非常小,判断模型收敛。 + +`GMM` 是一种软分类方法,相比 `k-means` 更加灵活,能够适应更多不同的数据分布,进行分类;受噪声、离群点影响较小;对初始值相对不敏感;但算法复杂度较高,分类较慢。 + +#### 优化 +`GMM` 算法相对稳定,没有进行很多调整。但当(某一个簇的)数据量不足时,迭代过程中可能出现协方差矩阵不满秩,无法求逆的情况,实际上,这也和模型过拟合有关。针对这一点需要有进一步的处理。 + +由 $\Sigma_k = \frac{\sum_{i=1}^{N}{\gamma_{ik}(x_i - \mu_k)(x_i - \mu_k)^T}}{\sum_{i=1}^{N}{\gamma_{ik}}}$ 的公式可以看出, $\Sigma_k$ 的秩不会超过 $N_k$ ,即属于第 $k$ 个簇的样本数据量。记 $\Sigma_k$ 的维度,即数据维度为 $D$ ,当 $N_k < D$ ,即数据维度大于数据量时, $\Sigma_k$ 就一定会变成奇异矩阵;当然,只要 $N_k$ 相对较小, $D$ 相对较大时,就有可能出现 $\Sigma_k$ 不满秩的情况。 + +从抽象的角度看,对于对称阵 $\Sigma_k$ ,需要估计的参数有 $\frac{D(D+1)}{2}$ 个,当 $D$ 比较大的时候,需要估计的参数量也会很大,也就是说,模型复杂度很高。当训练数据不足时,无法确定最优的模型,即出现了过拟合现象。在没有更多数据的情况下,解决过拟合的一个常见方法就是通过正则化降低模型复杂度,或者说,加入某种先验分布。 + +我们实现的算法中,用 $\hat{\Sigma}_k = \Sigma_k + \lambda I$ 代替了最大似然估计的 $\Sigma_k$ 。 + +因为 $\Sigma_k$ 一定是半正定矩阵,对于 $\forall v \ne 0$ , +$$v^T \hat{\Sigma}_k v =v^T (\Sigma_k + \lambda I) v = v^T \Sigma_k v + \lambda v^T I v \ge 0 + \lambda ||v||^2 > 0,$$ + +即 $\hat{\Sigma}_k$ 一定是正定(满秩)的。 + +这个方法的合理性可以从贝叶斯推断的角度得出,通过最大后验( `MAP` , Maximum a posteriori) 的方法估计协方差矩阵 $\Sigma$ ,即对 $\Sigma$ 加入先验的影响(参考 [Regularized Gaussian Covariance Estimation](https://freemind.pluskid.org/machine-learning/regularized-gaussian-covariance-estimation/ "Regularized Gaussian Covariance Estimation") )。 + +另外, `GMM` 算法初始化 $\Sigma_k$ 时,可能会出现某一个子模型的样本只有一个的情况,此时无法计算 $\Sigma_k$ (都为 `Nan` ),对此,我们先复制该样本,而后根据以上方法对 $\Sigma_k$ 进行修正。 + +最后,`k-means` 和 `GMM` 算法都无法避免维度灾难(Curse of Dimensionality);都需要手动设置聚簇(子模型)的数量,除非和别的算法结合。以及, `EM算法` 虽然保证收敛,但得到的依然是局部最小值。 + +## 基础实验 + +### 实验设计 + +之前提到 `k-means` 虽然保证收敛,但有一些缺陷,一是算法本身依赖初始簇中心设置,不同的簇中心会导致最终收敛的结果不同,缺乏一致性;二是关于适用的数据方面,对大小不同/密度不同;不同分布的数据适应情况不同;对离群点敏感;以及维度灾难问题。 + +在实验中,我们希望探究 `k-means` 的这些缺点,可能的解决方案,如 `k-means++` ,以及 `GMM` 相对 `k-means` 的优势。 + +- 对于初始簇中心的设置,探究随机设置的初始值和 `k-means++` 最终得到的分类结果有什么不同;以及不同的初始参数对 `GMM` 算法的影响(主要是随机设置的均值 $\boldsymbol{\mu}$ 和通过 `k-means` 设置的均值 $\boldsymbol{\mu}$ )。 + +- 关于数据大小/密度的影响,探究 `k-means` 和 `GMM` 算法在类别之间平衡/不平衡的数据集上的表现。 + +- 关于数据分布的影响,设置不同的数据分布,探究 `k-means` 和 `GMM` 算法对于不同分布的数据的分类能力。包括不同组数据服从相同/不同的分布;协方差矩阵为对角阵(对角阵的值相同/不同);协方差矩阵为一般的(半)正定矩阵的高斯分布;均匀分布等。 + +- 关于离群点敏感,由于如何判断某个样本是离群点还是新的类别本身就依赖领域知识的判断;对离群点的处理方式一般包括去除(`remove`)或裁剪(`clip`),也需要领域知识,在本次实验中不进行处理。 + +- 关于维度灾难问题,一般采用 `PCA` 等降维方式或通过别的方法提高分类算法效率,实验中不进行处理。 + +关于聚类的评价, + +- 对于 `k-means` 算法,采用 +$$W_K = \sum_{k=1}^{K} {\sum_{x_i \in \mathcal{C_k}} ||x_i - \mu_k||^2}$$ +作为 `loss function` ,评价结果。 + +- 对于 `GMM` 算法,采用 `negative log likelihood` ,即 $- \log \hat{L}$ 作为 `loss function` ,评价结果。 + +### 实验结果 + +#### 关于初始簇中心(均值 $\boldsymbol{\mu}$ )的设置 + +为了避免其他因素的影响,选择样本量平衡的数据集,通过高斯分布生成数据进行实验。 + +a. 根据以下高斯分布生成三组数据: + +| Cluster No. | $\boldsymbol{\mu}$ | $\boldsymbol{\Sigma}$ | Number of samples | +| :----: | :------------: | :------------: | :-------: | +| $1$ | $\begin{bmatrix} 1 & 2 \end{bmatrix}$ | $\begin{bmatrix} 70 & 0 \\\\ 0 & 22 \end{bmatrix}$ | $500$ | +| $2$ | $\begin{bmatrix} 16 & -5 \end{bmatrix}$ | $\begin{bmatrix} 21 & 0 \\\\ 0 & 32 \end{bmatrix}$ | $500$ | +| $3$ | $\begin{bmatrix} 10 & 22 \end{bmatrix}$ | $\begin{bmatrix} 10 & 5 \\\\ 5 & 10 \end{bmatrix}$ | $500$ | + +以下为随机设置初始簇中心的 `k-means` 算法的结果: + +lab1_1 + +lab1_2 + +> lab_number = 1 + + +以下为使用 `k-means++` 设置初始簇中心的 `k-means` 算法结果: + +lab2_1 + +lab2_2 + +> lab_number = 2 + + + +以下为随机设置初始均值的 `GMM` 算法的结果: + +lab3_1 + +lab3_2 + +> lab_number = 3 + + +以下为根据 `k-means` 算法设置初始均值的 `GMM` 算法的结果: + +lab4_1 + +lab4_2 + +> lab_number = 4 + + +对于实验中的高斯分布, `k-means` 和 `GMM` 算法最终聚类结果受初始值的影响并不是很大,但更好的初始值,即 `k-means` 中的 `k-means++` ,以及 `GMM` 中利用 `k-means` 设置初始均值 $\boldsymbol{\mu}$ 可以加快算法的收敛。 + +以后的实验中,对于 `k-means` 使用 `k-means++` 进行初始化; `GMM` 利用 `k-means` 算法初始化。 + +#### 关于数据大小/密度的影响 + +对平衡和不平衡的数据集的分类结果进行比较。 + +a. 根据以下高斯分布生成三组数据,控制类别数量。 + +| Cluster No. | $\boldsymbol{\mu}$ | $\boldsymbol{\Sigma}$ | Number of samples | +| :----: | :------------: | :------------: | :-------: | +| $1$ | $\begin{bmatrix} 1 & 5 \end{bmatrix}$ | $\begin{bmatrix} 70 & 0 \\\\ 0 & 70 \end{bmatrix}$ | $500$ | +| $2$ | $\begin{bmatrix} 15 & 10\end{bmatrix}$ | $\begin{bmatrix} 30 & 0 \\\\ 0 & 30 \end{bmatrix}$ | $500$ | +| $3$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | $\begin{bmatrix} 50 & 0 \\\\ 0 & 50 \end{bmatrix}$ | $500$ | + +以下为样本数据平衡时 `k-means` 算法的结果: + +lab5_1 + +lab5_2 + +> lab_number = 5 + +以下为样本数据平衡时 `GMM` 算法的结果: + +lab6_1 + +lab6_2 + +> lab_number = 6 + +| Cluster No. | $\boldsymbol{\mu}$ | $\boldsymbol{\Sigma}$ | Number of samples | +| :----: | :------------: | :------------: | :-------: | +| $1$ | $\begin{bmatrix} -10 & 50 \end{bmatrix}$ | $\begin{bmatrix} 50 & 25 \\\\ 25 & 40 \end{bmatrix}$ | $1000$ | +| $2$ | $\begin{bmatrix} 30 & 10\end{bmatrix}$ | $\begin{bmatrix} 30 & 20 \\\\ 20 & 60 \end{bmatrix}$ |$400$ | +| $3$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 50 \end{bmatrix}$ | $100$ | + +以下为样本数据不平衡时 `k-means` 算法的结果: + +lab7_1 + +lab7_2 + +> lab_number = 7 + +以下为样本数据不平衡时 `GMM` 算法的结果: + +lab8_1 + +lab8_2 + +> lab_number = 8 + +可以看出,`k-means` 算法对于不平衡的数据集聚类效果较差,而 `GMM` 可以更好的解决这个问题。 + +#### 关于数据分布的影响 + +##### 高斯分布 + +首先,探究两种算法对于高斯分布的聚类效果。 + +高斯分布产生的数据按协方差矩阵可以分为三类: `spherical` , `diagonal` , `full` 。 + +a. `spherical` ,即不同簇类有自己的方差。 + +根据以下高斯分布生成三组数据,每组 $500$ 个样本,共 $1500$ 个: + +| Cluster No. | $\boldsymbol{\mu}$ | $\boldsymbol{\Sigma}$ | Number of samples | +| :----: | :------------: | :------------: | :-------: | +| $1$ | $\begin{bmatrix} 1 & 30 \end{bmatrix}$ | $\begin{bmatrix} 50 & 0 \\\\ 0 & 50 \end{bmatrix}$ | $500$ | +| $2$ | $\begin{bmatrix} 15 & 10 \end{bmatrix}$ | $\begin{bmatrix} 30 & 0 \\\\ 0 & 30 \end{bmatrix}$ | $500$ | +| $3$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | $\begin{bmatrix} 50 & 0 \\\\ 0 & 50 \end{bmatrix}$ | $500$ | + + +lab9_1 + +lab9_2 + +> lab_number = 9 + + +lab10_1 + +lab10_2 + +> lab_number = 10 + + +b. `diagonal` ,即不同簇类的协方差均为对角阵。 + +根据以下高斯分布生成三组数据: + +| Cluster No. | $\boldsymbol{\mu}$ | $\boldsymbol{\Sigma}$ | Number of samples | +| :----: | :------------: | :------------: | :-------: | +| $1$ | $\begin{bmatrix} 1 & 30 \end{bmatrix}$ | $\begin{bmatrix} 30 & 0 \\\\ 0 & 20 \end{bmatrix}$ | $500$ | +| $2$ | $\begin{bmatrix} 15 & 10 \end{bmatrix}$ | $\begin{bmatrix} 60 & 0 \\\\ 0 & 50 \end{bmatrix}$ | $500$ | +| $3$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | $\begin{bmatrix} 70 & 0 \\\\ 0 & 30 \end{bmatrix}$ | $500$ | + + +lab11_1 + +lab11_2 + +> lab_number = 11 + +lab12_1 + +lab12_2 + +> lab_number = 12 + + + +c. `full` ,即不同簇类的协方差矩阵中的( $\frac{D(D+1)}{2}$ 个)参数都为自由设定的( $D$ 为样本数据维度)。 + +根据以下高斯分布生成三组数据: + +| Cluster No. | $\boldsymbol{\mu}$ | $\boldsymbol{\Sigma}$ | Number of samples | +| :----: | :------------: | :------------: | :-------: | +| $1$ | $\begin{bmatrix} 1 & 30 \end{bmatrix}$ | $\begin{bmatrix} 50 & 25 \\\\ 25 & 40 \end{bmatrix}$ | $500$ | +| $2$ | $\begin{bmatrix} 15 & 10\end{bmatrix}$ | $\begin{bmatrix} 30 & -20 \\\\ -20 & 60 \end{bmatrix}$ | $500$ | +| $3$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | $\begin{bmatrix} 20 & -15 \\\\ -15 & 50 \end{bmatrix}$ | $500$ | + +lab13_1 + +lab13_2 + +> lab_number = 13 + +lab14_1 + +lab14_2 + +> lab_number = 14 + + +##### 其他分布 + +均匀分布 + +| Cluster No. | Distribution | Number of samples | +| :------: | :---------------------------------: | :-------: | +| $1$ | $f(x,y) = \begin{cases} \frac{1}{20}, && 0 \le x \le 20, \space y = 5 \\\\ 0, && \mathrm{else} \end{cases}$ | $500$ | +| $2$ | $f(x,y) = \begin{cases} \frac{1}{25}, && 3 \le x \le 28, \space y = 2 \\\\ 0, && \mathrm{else} \end{cases}$ | $500$ | + + +lab15_1 + +lab15_2 + +> lab_number = 15 + + +lab16_1 + +lab16_2 + +> lab_number = 16 + + + +| Cluster No. | Distribution | Number of samples | +| :------: | :---------------------------------: | :-------: | +| $1$ | $X \sim U(0, \pi) \\\\ Y = \sin(X) + 3$ | $500$ | +| $2$ | $X \sim U(0, \pi) \\\\ Y = 3 \sin(X) - 2$ | $500$ | + + + +lab17_1 + +lab17_2 + +> lab_number = 17 + + +lab18_1 + +lab18_2 + +> lab_number = 18 + + +对于以上不服从高斯分布的,不同数据类别之间的“距离”对于聚类的准确性影响更大,当该“距离”较大时, `k-means` 算法用 `k-means++` 初始化后都不需要进一步迭代; `GMM` 算法的迭代次数也很少。 + +### 小结 +- `k-means` 算法的优点是算法复杂度较低,效率较高;缺点是难以处理不平衡数据集;适应的数据分布没有 `GMM` 算法多;以及初始值的设置对 `k-means` 比较重要。 + +- `GMM` 算法复杂度较高;解决了部分 `k-means` 的问题,包括处理不平衡数据集、适应更多不同的数据分布;但是在实验中发现,(相比原始类别数据,) `GMM` 的分类结果并不一定比 `k-means` 好。 + +- `k-means` 聚类边界明显(硬分类);`GMM` 的聚类边界较为平滑(软分类)。 + +- 对于重合度较高的分布,`k-means` 和 `GMM` 基本都不会有很好的聚类效果,在实际使用中,需要根据具体情况,考虑增大不同组之间的距离(如优化特征提取)。 + +- 在实验中设置的数据分布没有体现出不同的初始值对聚类算法结果的很大影响,但更好的初始值 (如 `k-means++` 和 `GMM` 中利用 `k-means` 设置初始均值)能够提高收敛的效率。 + + +## 自动选择聚簇数量的实验 +以上部分的实验都事先给定了聚簇数量,而实际情况中可能不能手动确定聚簇数量,需要自动选择聚簇数量的方法。这可以看作对以上聚类算法的进一步优化。 + +### 算法 + +#### Gap statistic +直觉上,聚类算法的目标是最大化类内的相似度,同时使得类间的相似度最低(最大化类间的 dissimilarity )。因而最优的聚簇是有一定主观性的,依赖于衡量“相似度”的方法,以及需要的分类的精细度。一种常用的衡量“相似度”的方法就是欧式距离。 + +根据欧氏距离,可以通过 + +$$D_k = \sum_{x_i \in \mathcal{C_k}} \sum_{x_j \in \mathcal{C_k}} ||x_i - x_j||^2 = 2 N_k \sum_{x_i \in \mathcal{C_k}} ||x_i - \mu_k||^2$$ + +衡量某一聚簇 $\mathcal{C_k}$ 内部的总距离,其中, $N_k = |\mathcal{C_k}|$ 为 $\mathcal{C_k}$ 内部元素总个数, $\mu_k$ 为簇均值, $x_i, \space x_j$ 为簇内样本。 + +令 +$$W_K = \sum_{k=1}^{K} {\frac{1}{2 N_k} D_k} ,$$ + +通过加入标准项 $\frac{1}{2 N_k}$ ,而后计算所有聚簇内部总距离之和,可以用 $W_K$ 衡量聚类的 `compactness` , $W_K$ 越小,聚簇结果越好。( `k-means` 算法中,给定聚簇数量 $K$ ,可以把 $W_K$ 看作最小化的目标,即 `cost function`。) + +由于 +$$\begin{align} W_K &= \sum_{k=1}^{K} {\frac{1}{2 N_k} D_k} \\\\ &= \sum_{k=1}^{K} {\frac{1}{2 N_k} \cdot (2 N_k \sum_{x_i \in \mathcal{C_k}} ||x_i - \mu_k||^2)} \\\\ &= \sum_{k=1}^{K} {\sum_{x_i \in \mathcal{C_k}} ||x_i - \mu_k||^2} , \end{align}$$ +具体实现中,直接通过 $W_K = \sum_{k=1}^{K} {\sum_{x_i \in \mathcal{C_k}} ||x_i - \mu_k||^2}$ 计算 $W_K$ 。 + +`Gap statistic` 方法通过将一个合适的零分布作为参考(`an appropriate reference null distribution`),比较聚类算法得出的结果,选择最优的聚簇数量 `K` ,即相对的 $W_K$ 最小。 + +具体的,构造 `gap statstic` ,选择使之最大的 `K` 作为聚簇数量。 + +`gap statistic` 的定义如下: + +$$\text{Gap}_{N}(K) = E\_{N}^{*}\\{\log{(W_K)}\\} - \log{(W_K)};$$ + +其中 $E_{N}^{*}$ 是指零分布下的,样本数为 $N$ 的期望。(可以看出,论文中提到的方法并没有对聚类算法以及零分布的选择有任何限制。) + +记最佳聚簇数量为 $K_b$ ,当选择的聚簇数量 $K \le K_b$ 时, $\log{(W_K)}$ 的下降速率会大于其期望的下降速率;当 $K > K_b$ 时,实际上是增加了一个不必要的簇中心, $\log{(W_K)}$ 的下降速率会小于其期望,因而,当 $K = K_b$ 时, `gap statistic` 会达到最大值。 + +具体实现的过程中, + +- 关于作为参考的零分布(`reference distribution`),选择了论文中较为简单的方法,即根据样本数据每个特征维度范围内的均匀分布产生数据,即 $f\_{d}^* \sim U(\min(f\_{d}), \max(f\_d))$ ,其中 $f\_d$ 指 $d$ 维度上的特征值; $f\_{d}^{*}$ 为根据参考分布产生的模拟数据。 + +- 关于对 $E\_{N}^{*} \\{\log(W\_K)\\}$ 的估计,采用 `Monte Carlo` 算法,根据参考分布产生 $N$ 个样本数据 $X\_{1}^{\*}, \cdots, X_{N}^{\*}$ ,计算 $\log(W\_{K}^{\*})$ ,重复 $B$ 次,取平均值(具体的,在实验中令 $B = 10$ )。 + +- 另外,还要考虑模拟产生的误差,记 $B$ 次 `Mote Carlo` 模拟产生的 $\log(W\_{K}^{*})$ 的标准差为 $\text{sd}(K)$ ,最后考虑的误差则为 +$$s_K = \text{sd}(K) \sqrt{1 + \frac{1}{B}} .$$ + + +算法步骤: + +1. 根据给定算法以及聚簇数量 $K$ 分别对样本数据进行聚类($K = 1, 2, \cdots, K_{max}$),并且计算 $W_K$ ; + +2. 根据参考分布生成 $B$ 组数据,根据每组数据计算 $W_{Kb}^{*}, \space b = 1,2,\cdots, B, \space K = 1,2,\cdots,K_{max}.$ 而后计算 `gap statistic` : + +$$\text{Gap}(K) = \frac{1}{B} \sum_{b=1}^{B} {\log{(W_{Kb}^{*})}} - \log{(W_K)} .$$ + +3. 令 $\bar{l} = \frac{1}{B} \sum_{b=1}^{B} {\log{(W_{Kb}^{*})}}$ ,计算标准差 + +$$\text{sd}_K = [\frac{1}{B} \sum\_{b=1}^{B} \\{\log(W\_{Kb}^{*}) - \bar{l}\\}^2]^{\frac{1}{2}} ,$$ + +令 $s_K = \text{sd}(K) \sqrt{1 + \frac{1}{B}}$ ,选择 + +$$\hat{K} = \text{smallest } K \text{ such that } \text{Gap}(K) \ge \text{Gap}(K+1) - s_{K+1}.$$ +作为聚簇数量。 + +(参考 [ [Tibshirani et al., 2001](https://statweb.stanford.edu/~gwalther/gap "Estimating the number of clusters in a data set via the gap statistic") ] ) + +由于 `GMM` 算法本身复杂度较高,而 `gap statistic` 的方法又需要多次模拟、计算聚类结果,因而这两者的结合会导致算法运行很慢。后文将 `k-means` 和 `gap statistic` 结合进行实验。 + +#### Information Criterions (ICs) +`ICs` 是一种选择模型的准则,可以很容易的和基于最大似然的模型结合起来。 + +一般,随着模型复杂度的提升,模型对数据的拟合能力( `maximum likelihood` )也会提升,但会出现过拟合的问题。 `ICs` 通过引入对模型复杂度的惩罚项,希望得到最合适的模型,即在模型对数据的拟合能力与较小的模型复杂度之间达到平衡。 + +##### Akaike Information Criterion (AIC) +`AIC` 的定义是 + +$$\text{AIC} = 2K - 2 \log(\hat{L}) ,$$ + +其中, $K$ 为模型的自由参数量(`number of free parameters`); $\hat{L}$ 是模型估计的最大化的似然函数值。 + +一般选择 `AIC` 值更小的模型。 + +`AIC` 通过引入对模型复杂度(模型参数量)的惩罚项,平衡模型的拟合能力与(较小的)模型复杂度。 + +一般 `AIC` 在聚类算法中的效果不是很好 [ [Nylund et al., 2007](https://www.semanticscholar.org/paper/Deciding-on-the-Number-of-Classes-in-Latent-Class-A-Nylund-Asparouhov/439672067967d3501b0d201bdb50a60886459b00 "Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study") ] ,由于惩罚力度不够,倾向选择过拟合的模型。 + +##### Bayesian Information Criterion (BIC) +`BIC` 选择的模型是使得 + +$$\log(\hat{L}) - \frac{1}{2} K \log{(N)} $$ + +最大的模型。其中, $K$ 为模型的自由参数量(`number of free parameters`); $N$ 为样本数据量; $\hat{L}$ 是模型估计的最大化的似然函数值。 [ [Schwarz, 1978](http://www.jstor.org/stable/2958889 "Estimating the Dimension of a Model") ] + + +相比 `Akaike Information Criterion (AIC)` , `BIC` 对模型复杂度(模型参数量)的惩罚力度更大,且考虑了样本数据量,当数据量较大时对模型参数惩罚更多,倾向于参数较少的简单模型。 + +在实际计算中,采用 +$$\text{BIC} = K \ln (N)-2 \ln (\hat{L}) ,$$ +选择 `BIC` 值更小的模型。 + +在后文中将 `BIC` 与 `GMM` 结合进行实验。 + +##### 算法实现 +在具体的实现过程中, + +记样本数据量为 $N$ ,样本数据维度为 $D$ ,聚簇(子模型)数量为 $K$ , + +模型的自由参数量即为各聚簇中心(子模型的均值) $\boldsymbol{\mu}$ 、协方差矩阵 $\boldsymbol{\Sigma}$ 以及各子模型概率 $\boldsymbol{\alpha}$ 中的自由参数, + +- 聚簇中心的自由参数量为 $K \cdot D$ ; + +- 协方差矩阵的自由参数量为 $K \cdot \frac{D(D+1)}{2}$ ; + +- 子模型概率的自由参数量为 $K - 1$ ( $\sum_{k=1}^{K} \alpha_k = 1$ )。 + +对数最大似然 $\log (\hat{L})$ 为 + +$$\log (\hat{L}) = \sum_{i=1}^{N} \log P(x_i|\theta) = \sum_{i=1}^{N} \log[\sum_{k=1}^{K} \alpha_k \phi (x_i|\theta_k)] .$$ + +在分类过程中参考 [ [Pelleg & Moore, 2000](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.3377 "X-means: Extending K-means with Efficient Estimation of the Number of Clusters") ] , + +通过循环的方式不断进行分类:先将所有样本看作一个聚簇,而后根据 `k-means` 设置两个中心点,计算分成两簇后的 `IC` 值是否比一簇的小,如果 `IC` 值没有下降,则不分为两簇;若 `IC` 值下降,则分成两簇,而后对每一簇内部继续进行循环。直到 $K$ 达到设定的 $K_{max}$ 或 `IC` 值不再下降,结束循环。 + + +### 实验 + +实验探究以上两种选择聚簇数量的方法: + +1. `k-means` + `gap statistic` ; +2. `GMM` + `BIC` ; + + +| Cluster No. | $\boldsymbol{\mu}$ | $\boldsymbol{\Sigma}$ | Number of samples | +| :----: | :------------: | :------------: | :-------: | +| $1$ | $\begin{bmatrix} -10 & 50 \end{bmatrix}$ | $\begin{bmatrix} 50 & 25 \\\\ 25 & 40 \end{bmatrix}$ | $500$ | +| $2$ | $\begin{bmatrix} 30 & 10\end{bmatrix}$ | $\begin{bmatrix} 30 & 20 \\\\ 20 & 60 \end{bmatrix}$ | $500$ | +| $3$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 50 \end{bmatrix}$ | $500$ | + +以下为 `k-means + gap statistic` 的表现: + +lab19_1 + +lab19_2 + + +> lab_number = 19 + +以下为 `GMM + BIC` 的表现: + +lab20_1 + +lab20_2 + + +> lab_number = 20 + + +| Cluster No. | $\boldsymbol{\mu}$ | $\boldsymbol{\Sigma}$ | Number of samples | +| :----: | :------------: | :------------: | :-------: | +| $1$ | $\begin{bmatrix} -100 & 50 \end{bmatrix}$ | $\begin{bmatrix} 50 & 25 \\\\ 25 & 40 \end{bmatrix}$ | $500$ | +| $2$ | $\begin{bmatrix} 30 & 10\end{bmatrix}$ | $\begin{bmatrix} 30 & 20 \\\\ 20 & 60 \end{bmatrix}$ | $500$ | +| $3$ | $\begin{bmatrix} 10 & 20 \end{bmatrix}$ | $\begin{bmatrix} 20 & 0 \\\\ 0 & 50 \end{bmatrix}$ | $500$ | + +修改第 `1` 组均值 $\boldsymbol{\mu}$ 后, `GMM + BIC` 的表现有所提升: + +lab21_1 + +lab21_2 + +## 总结 +- `GMM` 算法理论上聚类效果优于 `k-means` 算法,但效率较低;面对重合度较高的数据,以及非高斯分布的其他数据,按照给定的类别划分,效果没有特别好。不同数据类别之间的距离起到的影响更大。在实际应用中可以考虑更好的提取特征,拉大不同数据分布之间的“距离”。 + +- 更好的初始参数设置能够提高算法收敛的速度;其对模型聚类结果的影响有待进一步的探究。 + +- 自动寻找聚簇数量的算法表现与数据分布的关联较大,当数据分布较远时效果较好。具体情况可以进一步探究。可能直接采用 `x-means` [ [Pelleg & Moore, 2000](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.3377 "X-means: Extending K-means with Efficient Estimation of the Number of Clusters") ] 的算法会达到较好的结果。 + + +## Reference + +[1]. [Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. SODA ’07.](https://courses.cs.duke.edu/cps296.2/spring07/papers/kMeansPlusPlus.pdf "k-means++: The Advantages of Careful Seeding") + +[2]. [Nylund, K. L., Asparouhov, T., & Muthén, B. O. (2007). Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study. Structural Equation Modeling: A Multidisciplinary Journal, 14(4), 535–569.](https://doi.org/10.1080/10705510701575396 "Deciding on the Number of Classes in Latent Class Analysis and Growth Mixture Modeling: A Monte Carlo Simulation Study") + +[3]. [Pelleg, D., & Moore, A. (2000). X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In Proceedings of the 17th International Conf. on Machine Learning, 727–734.](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.3377 "X-means: Extending K-means with Efficient Estimation of the Number of Clusters") + +[4]. [Schwarz, G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461-464.](http://www.jstor.org/stable/2958889 "Estimating the Dimension of a Model") + +[5]. [Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the Number of Clusters in a Data Set via the Gap Statistic. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(2), 411–423.](https://statweb.stanford.edu/~gwalther/gap "Estimating the number of clusters in a data set via the gap statistic") + +[6]. [Regularized Gaussian Covariance Estimation](https://freemind.pluskid.org/machine-learning/regularized-gaussian-covariance-estimation/ "Regularized Gaussian Covariance Estimation") + + +## 代码使用方法 +``` +python source.py lab_number #lab_number = 1, 2, ..., 20 + e.g., python source.py 1 +``` \ No newline at end of file diff --git a/assignment-3/submission/18300110042/img/GMM_diagram.png b/assignment-3/submission/18300110042/img/GMM_diagram.png new file mode 100644 index 0000000000000000000000000000000000000000..7aca170b7a91c7f7be98b84b8c03881eaf9a74a6 Binary files /dev/null and b/assignment-3/submission/18300110042/img/GMM_diagram.png differ diff --git a/assignment-3/submission/18300110042/img/k_means_diagram.png b/assignment-3/submission/18300110042/img/k_means_diagram.png new file mode 100644 index 0000000000000000000000000000000000000000..94f49b8b28ac66170e99dc78c80889678c383707 Binary files /dev/null and b/assignment-3/submission/18300110042/img/k_means_diagram.png differ diff --git a/assignment-3/submission/18300110042/img/lab10_1.png b/assignment-3/submission/18300110042/img/lab10_1.png new file mode 100755 index 0000000000000000000000000000000000000000..a66cacbdc9f05323ff9499b5d15a3495cb26c8a1 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab10_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab10_2.png b/assignment-3/submission/18300110042/img/lab10_2.png new file mode 100755 index 0000000000000000000000000000000000000000..4e5eb4082e0b8f08c1d5cdee0816a70bb736b8fe Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab10_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab11_1.png b/assignment-3/submission/18300110042/img/lab11_1.png new file mode 100755 index 0000000000000000000000000000000000000000..f2edca237dcf748ee59ed10b5dcf997708091db5 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab11_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab11_2.png b/assignment-3/submission/18300110042/img/lab11_2.png new file mode 100755 index 0000000000000000000000000000000000000000..0a6a0209363ff6c7cdcfa105a76b6e4eb87f486f Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab11_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab12_1.png b/assignment-3/submission/18300110042/img/lab12_1.png new file mode 100755 index 0000000000000000000000000000000000000000..4072e0cb688325e3d6a5cf8e561470a903107c6b Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab12_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab12_2.png b/assignment-3/submission/18300110042/img/lab12_2.png new file mode 100755 index 0000000000000000000000000000000000000000..3632e8e6e6d87d079517d844a1ffcae2195a6b62 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab12_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab13_1.png b/assignment-3/submission/18300110042/img/lab13_1.png new file mode 100755 index 0000000000000000000000000000000000000000..cb15e456d11e95fa5169f44a76265f828599b642 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab13_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab13_2.png b/assignment-3/submission/18300110042/img/lab13_2.png new file mode 100755 index 0000000000000000000000000000000000000000..9e1847354e67a4960b346c88df69c4a2f898aade Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab13_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab14_1.png b/assignment-3/submission/18300110042/img/lab14_1.png new file mode 100755 index 0000000000000000000000000000000000000000..aad4bba82a43e924b0a026d278ed4dc4ff6a745f Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab14_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab14_2.png b/assignment-3/submission/18300110042/img/lab14_2.png new file mode 100755 index 0000000000000000000000000000000000000000..7e120c5da8798dc416088dda79e5cfbb0ecea82b Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab14_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab15_1.png b/assignment-3/submission/18300110042/img/lab15_1.png new file mode 100755 index 0000000000000000000000000000000000000000..d45a9656e2abb3a13580f1cd20edeed2bdb32d5d Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab15_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab15_2.png b/assignment-3/submission/18300110042/img/lab15_2.png new file mode 100755 index 0000000000000000000000000000000000000000..2ef4fb35df22f4cb2128e8788b51ce64fa3fc00f Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab15_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab16_1.png b/assignment-3/submission/18300110042/img/lab16_1.png new file mode 100755 index 0000000000000000000000000000000000000000..50a202a78585385f3fa466a7d2cd417059a10691 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab16_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab16_2.png b/assignment-3/submission/18300110042/img/lab16_2.png new file mode 100755 index 0000000000000000000000000000000000000000..033e1c208a2d6a03051696ffe0fe0d7d5031d2da Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab16_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab17_1.png b/assignment-3/submission/18300110042/img/lab17_1.png new file mode 100755 index 0000000000000000000000000000000000000000..721094c98cb2606620fb1a904a656420ed49c2cf Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab17_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab17_2.png b/assignment-3/submission/18300110042/img/lab17_2.png new file mode 100755 index 0000000000000000000000000000000000000000..ba3981665dbdc88f653e012b408200e64076a8ec Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab17_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab18_1.png b/assignment-3/submission/18300110042/img/lab18_1.png new file mode 100755 index 0000000000000000000000000000000000000000..5d16ed4cbcb00de3958f8c611e4f4d10fc6f78be Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab18_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab18_2.png b/assignment-3/submission/18300110042/img/lab18_2.png new file mode 100755 index 0000000000000000000000000000000000000000..a5e6a0c91a1e258c9b4b1c5f6373165caa46986a Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab18_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab19_1.png b/assignment-3/submission/18300110042/img/lab19_1.png new file mode 100755 index 0000000000000000000000000000000000000000..1c16f63b9462f612be484e3766f58774ab355c58 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab19_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab19_2.png b/assignment-3/submission/18300110042/img/lab19_2.png new file mode 100755 index 0000000000000000000000000000000000000000..1cb5d6bba5e239da79f1b53ca38a26511eff2437 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab19_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab1_1.png b/assignment-3/submission/18300110042/img/lab1_1.png new file mode 100755 index 0000000000000000000000000000000000000000..0d065d79783d3f0dc5790ba702c2078691be64c3 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab1_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab1_2.png b/assignment-3/submission/18300110042/img/lab1_2.png new file mode 100755 index 0000000000000000000000000000000000000000..663a4423e20215025634f2522d0ad733e60b42f5 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab1_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab20_1.png b/assignment-3/submission/18300110042/img/lab20_1.png new file mode 100755 index 0000000000000000000000000000000000000000..6065d78f2eedad23a9a0f08ce2821c9a7d6b34af Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab20_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab20_2.png b/assignment-3/submission/18300110042/img/lab20_2.png new file mode 100755 index 0000000000000000000000000000000000000000..e3a30f7a1d3295f475640d5359ab491bae9c6c70 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab20_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab21_1.png b/assignment-3/submission/18300110042/img/lab21_1.png new file mode 100755 index 0000000000000000000000000000000000000000..beabc57f2c7304b1c3d642454b1d4015195ad321 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab21_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab21_2.png b/assignment-3/submission/18300110042/img/lab21_2.png new file mode 100755 index 0000000000000000000000000000000000000000..350efc5571f36d264cf746b76a7ed6f267ddb07c Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab21_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab2_1.png b/assignment-3/submission/18300110042/img/lab2_1.png new file mode 100755 index 0000000000000000000000000000000000000000..19e98580685178a44eac846477ccfc66652c87ef Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab2_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab2_2.png b/assignment-3/submission/18300110042/img/lab2_2.png new file mode 100755 index 0000000000000000000000000000000000000000..42180bc7b50e146c5220a00e2d1f8307d9366fa5 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab2_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab3_1.png b/assignment-3/submission/18300110042/img/lab3_1.png new file mode 100755 index 0000000000000000000000000000000000000000..6fdc12a170d6137f8607d4a8ccb62aba4ddc2c59 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab3_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab3_2.png b/assignment-3/submission/18300110042/img/lab3_2.png new file mode 100755 index 0000000000000000000000000000000000000000..8259b5113db886835f95dc5cc89ec6698517103e Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab3_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab4_1.png b/assignment-3/submission/18300110042/img/lab4_1.png new file mode 100755 index 0000000000000000000000000000000000000000..b49001662f08daf078ff3ecb63d6faa91c56749b Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab4_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab4_2.png b/assignment-3/submission/18300110042/img/lab4_2.png new file mode 100755 index 0000000000000000000000000000000000000000..47e1b78fe46428bc27f81d41daa9661a637245e7 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab4_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab5_1.png b/assignment-3/submission/18300110042/img/lab5_1.png new file mode 100755 index 0000000000000000000000000000000000000000..31462903cbc638bb1da1c0f1686e3782ba9f45f4 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab5_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab5_2.png b/assignment-3/submission/18300110042/img/lab5_2.png new file mode 100755 index 0000000000000000000000000000000000000000..7225f89db276c9dac89fd47db732d260aeef7dcd Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab5_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab6_1.png b/assignment-3/submission/18300110042/img/lab6_1.png new file mode 100755 index 0000000000000000000000000000000000000000..09758724d97f872a8fffec115d3d435f234ca92a Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab6_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab6_2.png b/assignment-3/submission/18300110042/img/lab6_2.png new file mode 100755 index 0000000000000000000000000000000000000000..d87c3c5ddfb97b4d102f6fabc4613550d834034f Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab6_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab7_1.png b/assignment-3/submission/18300110042/img/lab7_1.png new file mode 100755 index 0000000000000000000000000000000000000000..7c65ad6dbfcc46010824f058a8736054854dcd4f Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab7_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab7_2.png b/assignment-3/submission/18300110042/img/lab7_2.png new file mode 100755 index 0000000000000000000000000000000000000000..a4ce12894ab59103ddc646002938802f8678c702 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab7_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab8_1.png b/assignment-3/submission/18300110042/img/lab8_1.png new file mode 100755 index 0000000000000000000000000000000000000000..4e2334e3a492ce916c9ba4912c9c5c02d9aa1273 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab8_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab8_2.png b/assignment-3/submission/18300110042/img/lab8_2.png new file mode 100755 index 0000000000000000000000000000000000000000..d046db41d4ca511eb51a1b6d143c77884a0b5d4a Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab8_2.png differ diff --git a/assignment-3/submission/18300110042/img/lab9_1.png b/assignment-3/submission/18300110042/img/lab9_1.png new file mode 100755 index 0000000000000000000000000000000000000000..47a601356e617caecd75aa60acd7332d2c738872 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab9_1.png differ diff --git a/assignment-3/submission/18300110042/img/lab9_2.png b/assignment-3/submission/18300110042/img/lab9_2.png new file mode 100755 index 0000000000000000000000000000000000000000..04f395072ad540a60c71b2b84d1a0808d3fdd805 Binary files /dev/null and b/assignment-3/submission/18300110042/img/lab9_2.png differ diff --git a/assignment-3/submission/18300110042/source.py b/assignment-3/submission/18300110042/source.py new file mode 100644 index 0000000000000000000000000000000000000000..7408b8bfff7ba3358fc09eadf14bf0fcb00f85bc --- /dev/null +++ b/assignment-3/submission/18300110042/source.py @@ -0,0 +1,1137 @@ +from os import error +import sys +import numpy as np +import matplotlib.pyplot as plt +from numpy.core.einsumfunc import _optimal_path +from numpy.core.fromnumeric import mean, std +from numpy.core.numeric import identity +from numpy.random import random_sample + +class KMeans: + ''' Implementing K-Means clustering algorithm. ''' + + def __init__(self, n_clusters=1, init='random', max_iter=100, tol=1e-6, random_state=825): + ''' + Initializ KMeans class. + Args: + n_clusters: int, default=1 + Number of clusters to form as well as the number of centroids to generate. + init: 'random' or 'kmeans++', default='random' + Method used to initialize centorids. Must be 'random' or 'kmeans++' or 'input', + - 'random': choose initial centroids randomly + - 'kmeans++' : choose initial centroids in a smart way to speed up convergence. + max_iter: int, default=100 + Maximum number of iterations to run. + tol: float, default=1e-6 + The convergence threshold. Iterations will stop when the difference between the + centroids of two consecutive iterations is below this threshold. + random_state: int, default=825 + RandomState used to determine random number generation for centroids initialization. + ''' + assert init == 'random' or init == 'kmeans++', 'Unknow initializing method.' + self.n_clusters = n_clusters + self.init = init + self.max_iter = max_iter + self.tol = tol + self.random_state = random_state + self.labels = None # data labels + self.centroids = None # center of clusters + self.errors_log = None # sum of squares error + + + def _select_centroids(self, X, K): + ''' + Select K centroids from dataset X. + Args: + X: numpy array (n_samples, n_featuers) + The dataset among which the centroids are selected. + K: int + The number of centroids to be selected. + Return: + centroids: numpy array (K, n_features) + The selected K centroids. + ''' + # check the number of train dataset + n_samples, n_features = X.shape + assert n_samples >= K, "The number of data must be great than clusters." + + rng = np.random.RandomState(self.random_state) + + centroids = np.zeros((K, n_features)) + if self.init == 'random': # select centroids ramdomly + rand_idx = rng.permutation(n_samples) + centroids = X[rand_idx[:K]] + elif self.init == 'kmeans++': # select centroids using kmeans++ algorithm + # randomly select the first centroid + rand_idx = rng.choice(n_samples) + centroids[0] = X[rand_idx] + + # loop to select the rest K-1 centroids + distance_square = calculate_distance_square(X, centroids[0:1]).flatten() + for i in range(1, K): + # calculate the distance between all points and the current centroid + new_distance_square = calculate_distance_square(X, centroids[i-1:i]).flatten() + # compare and keep the shortest distance for all points + distance_square = np.min(np.vstack((distance_square, new_distance_square)), axis=0) + # calculate the probability and select the next centroid accordingly + best_idx = rng.choice(n_samples, size=1, p=distance_square/np.sum(distance_square)) + centroids[i] = X[best_idx] + return centroids + + + def _recalculate_model(self, X, K, labels, distance): + ''' + Recalculate and update model based on the labels of all data in dataset X. + Notes there may be empty cluster sometimes, and it will cause error. My solution + is to move the furthest point in the biggest cluster to the empty cluster. + Args: + X: numpy array (n_samples, n_featuers) + The dataset. + K: int + The number of cluster to be formed. + labels: numpy array (n_samples, ) + The labels identifying the cluster that each sample belongs to. + label[i] is the index of the cluster for the i-th observation. The clustering + includes n_clusters clusters. + distance: numpy array (n_samples, K) + Distance between points and centroids. + distance[i, j] is the distance from point(i) to centroid(j) + ''' + # check wheter there is an empty cluster. move a point to it if empty cluster exist. + for label in range(K): + l_keys, l_counts = np.unique(labels, return_counts=True) + if not label in l_keys: # empty cluster + # find the biggest cluster + l_key = l_keys[l_counts.argmax()] + # find the furthest point in the cluster + d_copy = np.copy(distance[:, l_key]) + d_copy[labels!=l_key] = 0 + p_idx = np.argmax(d_copy) + # move the point to the empty cluster + labels[p_idx] = label + + # recalculate centroids + centroids = np.zeros((K, X.shape[1])) + for i in range(K): + centroids[i, :] = np.mean(X[labels==i, :], axis=0) + + return labels, centroids + + + def _statistic_params(self, X): + ''' + Calculate the means, covariances and mixing probabilities for each clusters. + Args: + X: numpy array (n_samples, n_featuers) + The dataset. + Returns: + means: numpy array (K, n_features) + Means of the clusters. + covariances: numpy array (K, n_feature, n_feature) + Covariances of the clusters. + mixing_probs: numpy array (K, ) + Mixing probabilities of mixture model. + ''' + # set means to self.centroids + means = self.centroids + K, n_features = means.shape + covariances = np.zeros((K, n_features, n_features)) + mixing_probs = np.zeros(K) + + # calculate covariances and mixing probabilities + distances_square = calculate_distance_square(X, means) + labels = np.argmin(distances_square, axis=1) + for i in range(K): + Xi = X[labels==i] + if len(Xi) == 1: # if there is only one point in the cluster + # duplicate the point and calculate the covariances + covariances[i] = np.cov(np.vstack((Xi, Xi)).T) + else: + covariances[i] = np.cov(Xi.T) + mixing_probs[i] = 1.0 * len(Xi) / len(X) + + return means, covariances, mixing_probs + + + def fit(self, train_data, K=None, centroids=None, ic=False): + ''' + Fit the KMeans model in training data. + Args: + train_data: nump array (n_samples, n_featuers) + The input training dataset. + K: int, default=None + The input K. Enable change n_clusters when fit. + centroids: numpy array (K, n_features), default=None + The input centroids. Enable assign centroids when fit. + ic: bool, default=False + Whether to calculate and return means, covariances, mixing probabilities. + ''' + if K is not None: self.n_clusters = K + + if centroids is not None: + self.centroids = centroids + self.n_clusters = centroids.shape[0] + else: + self.centroids = self._select_centroids(train_data, self.n_clusters) + + # loop to fit kmeans model, break if iteration limits reached or the difference + # between the centroids of two consecutive iterations is lower than threshold + errors = [] + for i in range(self.max_iter): + # calculate distances between all points and all centroids + all_distance_square = calculate_distance_square(train_data, self.centroids) + + # calculate and update labels, centroids + labels = np.argmin(all_distance_square, axis=1) + old_centroids = self.centroids + self.labels, self.centroids = self._recalculate_model(train_data, + self.n_clusters, + labels, + all_distance_square) + + # calculate errors, update errors_log + inner_distance_square = np.zeros(train_data.shape[0]) + for i in range(self.n_clusters): + inner_distance_square[labels==i] = calculate_distance_square( + train_data[labels==i], + self.centroids[i:i+1]).flatten() + errors.append(np.sum(inner_distance_square)) + + # check convergence, break if centroids' diffirence is below threshold + if np.linalg.norm(self.centroids - old_centroids) < self.tol: + break + self.errors_log = np.array(errors) + + # if fit for information criterion, then calculate and return statistic parameters + if ic: + centroids, covariances, mixing_probs = self._statistic_params(train_data) + return centroids, self.labels, covariances, mixing_probs + else: + return self.n_clusters, self.centroids, self.labels, self.errors_log + + + def predict(self, test_data): + ''' + Predict to identify the labels for test dataset. + Args: + test_data: numpy array (n_samples, n_featuers) + The test dataset. + Returns: + The labels of test data. + ''' + distance_square = calculate_distance_square(test_data, self.centroids) + return np.argmin(distance_square, axis=1) + + + def optimize_k(self, X, optimizer): + ''' + Optimize K for the model. + Args: + X: numpy array (n_samples, n_featuers) + The dataset. + algorithm: must be GapsOptimizer or XICOptimizer + The optimizer implementing Gap Statistic or AIC/BIC algorithm. + Returns: + Optimized model. + ''' + K = optimizer.get_optimal_k(X) + self.__init__(n_clusters=K) + return self + + def get_sse(self, X): + ''' + Return sum of square error. + ''' + inner_distance_square = np.zeros(X.shape[0]) + for i in range(self.n_clusters): + inner_distance_square[labels==i] = calculate_distance_square( + X[labels==i], self.centroids[i:i+1]).flatten() + return np.sum(inner_distance_square) + + +class GaussianMixture: + ''' Implementing Gaussian Mixture clustering algorithm. ''' + + def __init__(self, n_clusters=1, init='random', max_iter=100, tol=1e-4, random_state=825): + ''' + Initializing GaussianMixture class. + Args: + n_clusters: int, default=1 + Number of clusters to form as well as the number of centroids to generate. + init: 'random' or 'kmeans++', default='random' + Method used to initialize the parameters. + -'random': initialize parameters randomly. + - 'kmeans': initialize parameters using kmeans algorithm. + max_iter: int, default=100 + Maximum number of iterations to run. + tol: float, default=1e-6 + The convergence threshold. EM iterations will stop when the lower bound average + gain is below this threshold. + random_state: int, default=825 + RandomState used to determine random number generation for parameters initialization. + ''' + assert init == 'random' or init == 'kmeans', 'Unknow initializing method.' + self.n_clusters = n_clusters + self.init = init + self.max_iter = max_iter + self.tol = tol + self.random_state = random_state + self.labels = None + self.centroids = None + self.covariances = None + self.mixing_probs = None + self.errors_log = None + + def _initialize_parameters(self, X, K, centroids): + ''' + Initialize parameters including means, covariances and mixing probabilities + based on dataset X. + Args: + X: numpy array (n_samples, n_featuers) + The dataset. + K: int + The number of clusters. + centroids: numpy array (K, n_features) + The centroids used to initialize the model. + Returns: + means: numpy array (K, n_features) + The initial K centroids. + covariances: numpy array (K, n_features, n_features) + The initial K covariaces. + mixing_probs: numpy array (K, ) + The mixing probabilities. + ''' + n_samples, n_features = X.shape + assert n_samples >= K, "The number of data must be great than clusters." + + rng = np.random.RandomState(self.random_state) + + if centroids is not None: + means = centroids + self.n_clusters = centroids.shape[0] + K = centroids.shape[0] + else: + # initialize means randomly + if self.init == 'random': + rand_idx = rng.permutation(n_samples) + X_shuffled = X[rand_idx] + means = X_shuffled[:K] + # initialize means using KMeans algorithm + elif self.init == 'kmeans': + kmeans = KMeans(K) + kmeans.fit(X) + means = kmeans.centroids + + # calculate covariances and mixing probabilities + covariances = np.zeros((K, n_features, n_features)) + mixing_probs = np.zeros(K) + distances_square = calculate_distance_square(X, means) + labels = np.argmin(distances_square, axis=1) + for i in range(K): + Xi = X[labels==i] + if len(Xi) == 0: + covariances[i] = np.zeros((n_features, n_features)) + elif len(Xi) == 1: # if there is only one point in the cluster + # duplicate the point and calculate the covariances + covariances[i] = np.cov(np.vstack((Xi, Xi)).T) + else: + covariances[i] = np.cov(Xi.T) + mixing_probs[i] = 1.0 * len(Xi) / n_samples + return means, covariances, mixing_probs + + + def _statistic_params(self, X): + return self.centroids, self.covariances, self.mixing_probs + + + def fit(self, train_data, K=None, centroids=None, ic=False): + ''' + Fit the model in the training dataset. + Args: + train_data: np.array (n_samples, n_features) + The input training dataset. + K: int, default=None + The input K. Enable change n_clusters when fit. + centroids: numpy array (K, n_features), default=None + The input centroids. Enable assign centroids when fit. + ic: bool, default=False + Whether to calculate and return means, covariances, mixing probabilities. + ''' + if K is not None: self.n_clusters = K + + self.centroids, self.covariances, self.mixing_probs = \ + self._initialize_parameters(train_data, self.n_clusters, centroids) + + n_samples = train_data.shape[0] + log_likelihood = 0 + errors = [] + for iter in range(self.max_iter): + # 1) calculate normal distribution probabilities density matrix + norm_densities = normal_pdf(train_data, self.centroids, self.covariances) + + # 2) check log likelihood for convergence, break if threshold reached + pxs = [np.dot(self.mixing_probs.T, norm_densities[i]) for i in range(train_data.shape[0])] + new_log_likelihood = np.sum(np.log(np.array(pxs))) + errors.append(-new_log_likelihood) + if np.abs(new_log_likelihood - log_likelihood) < self.tol: + break + log_likelihood = new_log_likelihood + + # 3) calculate posterior probability of each point for each cluster + responsibilities = np.empty((n_samples, self.n_clusters), np.float64) + for i in range(n_samples): + denominator = np.dot(self.mixing_probs.T, norm_densities[i]) + for j in range(self.n_clusters): + responsibilities[i][j] = self.mixing_probs[j] * norm_densities[i][j] / denominator + + # 4) update parameters means, covariances and mixing probabilities + for i in range(self.n_clusters): + responsibility = responsibilities[:, i] + denominator = np.sum(responsibility) + # re-calculate means + self.centroids[i] = np.dot(responsibility.T, train_data) / denominator + # re-calculate covariances + difference = train_data - np.tile(self.centroids[i], (n_samples, 1)) + _r = responsibility.reshape(n_samples, 1) + self.covariances[i] = np.dot(np.multiply(_r, difference).T, difference) / denominator + # re-calculate mixing probabilities + self.mixing_probs[i] = denominator / n_samples + + # calcuate and update model + self.labels = self.predict(train_data) + self.errors_log = np.array(errors) + + # if fit for information criterion, then calculate and return statistic parameters + if ic: + centroids, covariances, mixing_probs = self._statistic_params(train_data) + return centroids, self.labels, covariances, mixing_probs + else: + return self.n_clusters, self.centroids, self.labels, self.errors_log + + + def predict(self, test_data): + ''' + Predict the labels of test_data. + Args: + test_data: numpy array (n_samples, n_featuers) + The test dataset. + Returns: + labels: numpy array (n_samples, ) + The labels of test data. + ''' + n_row, _ = test_data.shape + labels = np.empty(n_row, dtype=np.int32) + probs = normal_pdf(test_data, self.centroids, self.covariances) + for i in range(n_row): + labels[i] = np.argmax(probs[i]) + return labels + + + def optimize_k(self, X, optimizer): + ''' + Optimize K for the model. + Args: + X: numpy array (n_samples, n_featuers) + The dataset. + algorithm: must be GapsOptimizer or XICOptimizer + The optimizer implementing Gap Statistic or AIC/BIC algorithm. + Returns: + Optimized model. + ''' + K = optimizer.get_optimal_k(X) + self.__init__(n_clusters=K) + return self + + + def get_sse(self, X): + ''' + Return sum of square error. + ''' + inner_distance_square = np.zeros(X.shape[0]) + for i in range(self.n_clusters): + inner_distance_square[labels==i] = calculate_distance_square( + X[labels==i], self.centroids[i:i+1]).flatten() + return np.sum(inner_distance_square) + + +class GapsOptimizer: + ''' + An optimizer that uses the gap statistic to estimate the optimal number of clusters. + ''' + def __init__(self, base_clusterer=None, B=10, min_k=1, max_k=10, random_state=825): + ''' + Initialize the GapStatistic optimizer. + Args: + base_clusterer: object or None, default=None + The base clusterer to use to cluster the data. If None, then the base + clusterer is K-Means. + B: int, default=10 + The number of reference data sets that are generated and clustered in + order to estimate the optimal number of clusters for the data set. + min_k: int, default=1 + The minimal number of clusters to consider when estimating the optimal + number of clusters for the data set. + max_k: int, default=10 + The maximum number of clusters to consider when estimating the optimal + number of clusters for the data set. + ''' + self.base_clusterer = self._check_clusterer(base_clusterer) + self.B = B + self.min_k = min_k + self.max_k = max_k + self.random_state = random_state + + + def _check_clusterer(self, clusterer=None): + ''' + Check that the clusterer is a valid clusterer with required methods. If no cluster + is provided, create a default KMeans clusterer. + Args: + clusterer: object or None, default=None + The clusterer to use to cluster the data sets. If None, then the clusterer + is K-Means. + Returns: + clusterer: object + The supplied clusterer or a default clusterer if none was provided. + ''' + if (clusterer is None): + clusterer = KMeans() + else: + # make sure base clusterer implements fit() + getattr(clusterer, 'fit') + # make sure base clusterer has n_clusters attribute + getattr(clusterer, 'n_clusters') + return clusterer + + + def _gap_statistic(self, X): + ''' + Gap statistic clustering algorithm. Uses the gap statistic method to estimate the + optimal number of clusters. + Args: + X: numpy array (n_samples, n_features) + The dataset to cluster. + Returns: + K: int + The estimate of the optimal number of clusters. + gaps: numpy array, (K, ) + The gap statistic calculated as log(W*) - log(W). + - W: The mean pooled within-cluter sum of squares around the cluster means + for the input data set. + - log(W): the logarithm of W. + - log(W*): The expectation of log(W) under an reference distribution of the data. + std_errs: numpy array, (K, ) + The standard error for the means of the B sample reference datasets + ''' + # randomly generate the B reference datasets based on the input dataset X + rdg = np.random.RandomState(self.random_state) + n_row, n_col = X.shape + maxs, mins = X.max(axis=0), X.min(axis=0) + ranges = np.diag(maxs - mins) + X_ref = rdg.random_sample(size=(self.B, n_row, n_col)) + for i in range(self.B): + X_ref[i, :, :] = X_ref[i, :, :] @ ranges + mins + # range of K from min_k to max_k + K = range(self.min_k, self.max_k + 1) + # calculate W and W* + W = np.zeros(len(K)) + W_star = np.zeros((len(K), self.B)) + for idx, k in enumerate(K): + # calculate sum of square error using self.clusterer + self.base_clusterer.fit(X, k) + W[idx] = self.base_clusterer.get_sse(X) + for i in range(self.B): + self.base_clusterer.fit(X_ref[i], k) + W_star[idx, i] = self.base_clusterer.get_sse(X) + # calculate gaps, and standard deviation + gaps = np.mean(np.log(W_star), axis=1) - np.log(W) + sd_ks = np.std(np.log(W_star), axis=1) + std_errs = sd_ks * np.sqrt(1 + 1.0 / self.B) + return gaps, std_errs + + + def get_optimal_k(self, X): + ''' + Estimate and return the optimal number of clusters. + Args: + X: numpy array (n_samples, n_features) + The dataset to cluster. + Returns: + n_clusters: int + The estimate of the optimal number of clusters. + ''' + # ensure the max_k is not greater than the samples of input dataset. + self.max_k = X.shape[0] if self.max_k > X.shape[0] else self.max_k + + # perform gap statistic to get the gaps and std_errs + gaps, std_errs = self._gap_statistic(X) + + # estimate optimal K + optimal_k = self.max_k + for k in range(self.max_k - self.min_k): + if std_errs[k+1] - (gaps[k+1] - gaps[k]) > 0: + optimal_k = self.min_k + k + break + + return optimal_k + + +class XICOptimizer: + ''' + An optimizer that uses AIC (Akaike information criterion) or Bayesian Information Criterion (BIC) + to approximate the correct number of clusters. + ''' + def __init__(self, base_clusterer=None, ic='BIC', min_k=1, max_k=10, random_state=825): + ''' + Initialize the BIC optimizer. + Args: + base_clusterer: object or None, default=None + The base clusterer to use to cluster the data. If None, then the base + clusterer is K-Means. + min_k: int, default=1 + The minimal number of clusters to consider when estimating the optimal + number of clusters for the data set. + max_k: int, default=10 + The maximum number of clusters to consider when estimating the optimal + number of clusters for the data set. + ''' + self.base_clusterer = self._check_clusterer(base_clusterer) + self.min_k = min_k + self.max_k = max_k + self.random_state = random_state + + + def _check_clusterer(self, clusterer=None): + ''' + Check that the clusterer is a valid clusterer with required methods. If no cluster + is provided, create a default KMeans clusterer. + Args: + clusterer: object or None, default=None + The clusterer to use to cluster the data sets. If None, then the clusterer + is K-Means. + Returns: + clusterer: object + The supplied clusterer or a default clusterer if none was provided. + ''' + if clusterer is None: + clusterer = KMeans() + else: + # make sure base clusterer implements fit() + getattr(clusterer, 'fit') + # make sure base clusterer has n_clusters attribute + getattr(clusterer, 'n_clusters') + return clusterer + + + def _xic(self, X, means, covariances, mixing_probs, type='BIC'): + ''' + Calculate the AIC (Akaike information criterion) or BIC (Bayesian Information Criterion), + The AIC is defined as: AIC = 2 * k - 2 * ln(L) + The BIC is defined as: BIC = k * ln(n) - 2 * ln(L) + where: + * ln(L) is the log-likelihood of the estimated model + * k is the number of parameters + * n is the number of samples + Args: + X: numpy array (n_samples, n_features) + The dataset to cluster. + type: must be 'AIC' or 'BIC' + The information criterion type. + Returns: + ic: float + AIC or BIC value. + ''' + assert type == 'AIC' or type == 'BIC', 'Unknow information criterion type.' + # calculate likelihood + means, covariances, mixing_probs = self.base_clusterer._statistic_params(X) + ll = calculate_log_likelihood(X, means, covariances, mixing_probs) + # calculate number of parameters + # K - 1 (mixing probabilities) + # K * n_features (means) + # K * n_features * (n_features + 1) / 2 (covariances) + n_features = X.shape[1] + k = means.shape[0] * (0.5*n_features**2 + 1.5*n_features +1) - 1 + # calculate number of samples + n = X.shape[0] + if type == 'AIC': + return 2 * k - 2 * ll + elif type == 'BIC': + return k * np.log(n) - 2 * ll + + + def get_optimal_k(self, X): + ''' + Estimate and return the optimal number of clusters. + Args: + X: numpy array (n_samples, n_features) + The dataset to cluster. + Returns: + n_clusters: int + The estimate of the optimal number of clusters. + ''' + # ensure the max_k is not greater than the samples of input dataset. + self.max_k = X.shape[0] if self.max_k > X.shape[0] else self.max_k + # estimate optimal K + k = self.min_k + _, centroids, labels, _ = self.base_clusterer.fit(X, k) + while k <= self.max_k: + new_centroids = [] + for i, centroid in enumerate(centroids): + Xi = X[labels==i] + if len(Xi) == 1: + new_centroids.append(centroid) + else: + means_1, _, covs_1, probs_1 = self.base_clusterer.fit(Xi, 1, np.array([centroid]), ic=True) + ic_1 = self._xic(Xi, means_1, covs_1, probs_1) + means_2, _, covs_2, probs_2 = self.base_clusterer.fit(Xi, 2, ic=True) + ic_2 = self._xic(Xi, means_2, covs_2, probs_2) + if ic_1 <= ic_2: # keep parent + new_centroids.append(means_1[0]) + else: # keep children + new_centroids.append(means_2[0]) + new_centroids.append(means_2[1]) + if k == len(new_centroids): + break + k = len(new_centroids) + new_centroids = np.array(new_centroids) + _, centroids, labels, _ = self.base_clusterer.fit(X, k, new_centroids) + return k + + +class ClusteringAlgorithm: + def __init__(self, clusterer='KMeans', optimizer='GapsOptimizer'): + assert clusterer == 'KMeans' or clusterer == 'GaussianMixture', 'Unknow clustering model.' + assert optimizer == 'GapsOptimizer' or optimizer == 'XICOptimizer', 'Unknow clustering optimizer.' + + if clusterer == 'KMeans': + self.clusterer = KMeans() + elif clusterer == 'GaussianMixture': + self.clusterer = GaussianMixture() + + if optimizer == 'GapsOptimizer': + self.optimizer = GapsOptimizer(self.clusterer) + elif optimizer == 'XICOptimizer': + self.optimizer = XICOptimizer(self.clusterer) + + def fit(self, train_data): + self.clusterer = self.clusterer.optimize_k(train_data, self.optimizer) + K, centroids, labels, errors_log = self.clusterer.fit(train_data) + return K, centroids, labels, errors_log + + def predict(self, test_data): + labels = self.model.predict(test_data) + + +def calculate_distance_square(X, centroids): + ''' + Help function to calculate distance^2 from each points in dataset X to eahc centroids. + Args: + X: numpy array (n_samples, n_featuers) + The dataset. + centroids: numpy array (n, n_features) + The centroids. + Returns: + distance: numpy array (n_samples, n) + Distance between points and centroids. + distance[i, j] is the distance from point(i) to centroid(j) + ''' + epsilon = 1e-8 + n_row = centroids.shape[0] + distance_square = np.zeros((X.shape[0], n_row)) + for i in range(n_row): + # distance between each point in X and the centroid[i] + row_norm = np.linalg.norm(X - centroids[i, :], axis=1) + # duplicate point will get 0 distance, this will cause some weird error. + # add a very small to avoid these error. + row_norm[row_norm - epsilon < 0] = epsilon + distance_square[:, i] = np.square(row_norm) + return distance_square + + +def normal_pdf(X, means, covariances): + ''' + Help functions to calculate the normal distribution probability density. + Arg: + X: numpy array (n_samples, n_features) + The dataset. + means: numpy array (K, n_features) + Means of the clusters. + covariances: numpy array (K, n_feature, n_feature) + Covariances of the clusters. + Returns: + densities: numpy array (n_samples, K) + The normal distribution probabilities density matrix. + ''' + # add perturbation to covariance matrix to avoid singular matrix error + epsilon = 1e-6 + n_samples = X.shape[0] + K = means.shape[0] + norm_densities = np.empty((n_samples, K), np.float64) + for i in range(n_samples): + x = X[i] + for j in range(K): + if len(x) == 1: # if 1D, covariance is the square of variance + cov = covariances[j] + epsilon * epsilon + # calculate the inverse of covariance + cov_inv = 1 / cov + # calculate the determinant of covariance + cov_det = cov + else: # if 2D or above + cov = covariances[j] + epsilon * np.identity(covariances[j].shape[0]) + # calculate the inverse of covariance + cov_inv = np.linalg.inv(cov) + # calculate the determinant of covariance + cov_det = np.linalg.det(cov) + # calculate the exponent portion of the formula + exponent = -0.5 * np.dot(np.dot((x - means[j]).T, cov_inv), (x - means[j])) + # calculate the denominator portion of the formula + denominator = np.power(2 * np.pi, K / 2) * np.sqrt(cov_det) + norm_densities[i][j] = np.exp(exponent) / denominator + return norm_densities + + +def calculate_log_likelihood(X, means, covariances, mixing_probs): + ''' + Help function to calculate the log likelihood of the model given the dataset and + parameters means, covariances, mixing probabilities + Arg: + X: numpy array (n_samples, n_features) + The dataset. + means: numpy array (K, n_features) + Means of the clusters. + covariances: numpy array (K, n_feature, n_feature) + Covariances of the clusters. + mixing_probs: numpy array (K, ) + Mixing probabilities of mixture model. + Returns: + log_likelihood: float + The log likelihood. + ''' + norm_densities = normal_pdf(X, means, covariances) + pxs = [np.dot(mixing_probs.T, norm_densities[i]) for i in range(X.shape[0])] + log_likelihood = np.sum(np.log(np.array(pxs))) + return log_likelihood + +#def plot_fit(C, K, X, labels, means, errors_log, labels_p, means_p): +def plot_fit(C, K, X, labels, errors_log, labels_p, means_p): + colours=['mediumpurple', 'gold', 'dodgerblue', 'royalblue', 'orange', + 'mediumaquamarine', 'coral', 'lightskyblue', 'cornflowerblue', 'lightcoral', + 'mediumturquoise', 'mediumslateblue', 'lightseagreen'] + fig, axs = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(11, 5)) + title = 'Algorithm: {0}, K: {1}'.format(C, K) + x_min, x_max = np.min(X[:, 0]) - 1 , np.max(X[:, 0]) * 1.05 + y_min, y_max = np.min(X[:, 1]) - 1, np.max(X[:, 1]) * 1.05 + # axs[0], original data + axs[0].grid() + axs[0].set_title('Original Data ({0})'.format(title)) + axs[0].set_xlim(x_min, x_max) + axs[0].set_ylim(y_min, y_max) + for i in range(len(np.unique(labels))): + axs[0].scatter(X[labels==i, 0], X[labels==i, 1], s=10, c=colours[i]) + #axs[0].scatter(means[i, 0], means[i, 1], c='k', s=60, marker='*', label='centroid') + # axs[1], clustering result + axs[1].grid() + axs[1].set_title('Clustering Result ({0})'.format(title)) + axs[1].set_xlim(x_min, x_max) + axs[1].set_ylim(y_min, y_max) + for i in range(means_p.shape[0]): + axs[1].scatter(X[labels_p==i, 0], X[labels_p==i, 1], s=10, c=colours[i]) + axs[1].scatter(means_p[i, 0], means_p[i, 1], c='k', s=60, marker='*', label='centroid') + plt.show() + # axs[2], loss + plt.grid() + plt.title('Loss ({0})'.format(title)) + k = list(range(len(errors_log))) + plt.plot(k, errors_log, '-o') + plt.show() + + +l_config = [ + { # 1 + 'means': [ [1, 2], [16, -5], [10, 22] ], + 'covs': [ [ [70, 0], [ 0, 22] ], + [ [21, 0], [ 0, 32] ], + [ [10, 5], [ 5, 10] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'random', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 2 + 'means': [ [1, 2], [16, -5], [10, 22] ], + 'covs': [ [ [70, 0], [ 0, 22] ], + [ [21, 0], [ 0, 32] ], + [ [10, 5], [ 5, 10] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 3 + 'means': [ [1, 2], [16, -5], [10, 22] ], + 'covs': [ [ [70, 0], [ 0, 22] ], + [ [21, 0], [ 0, 32] ], + [ [10, 5], [ 5, 10] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'random', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 4 + 'means': [ [1, 2], [16, -5], [10, 22] ], + 'covs': [ [ [70, 0], [ 0, 22] ], + [ [21, 0], [ 0, 32] ], + [ [10, 5], [ 5, 10] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 5 + 'means': [ [1, 5], [15, 10], [10, 20] ], + 'covs': [ [ [70, 0], [ 0, 70] ], + [ [30, 0], [ 0, 30] ], + [ [50, 0], [ 0, 50] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 6 + 'means': [ [1, 5], [15, 10], [10, 20] ], + 'covs': [ [ [70, 0], [ 0, 70] ], + [ [30, 0], [ 0, 30] ], + [ [50, 0], [ 0, 50] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 7 + 'means': [ [-10, 50], [30, 10], [10, 20] ], + 'covs': [ [ [50, 25], [25, 40] ], + [ [30, 20], [20, 60] ], + [ [20, 0], [ 0, 50] ] ], + 'n_data': [ 1000, 400, 100 ], + 'K': 3, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 8 + 'means': [ [-10, 50], [30, 10], [10, 20] ], + 'covs': [ [ [50, 25], [25, 40] ], + [ [30, 20], [20, 60] ], + [ [20, 0], [ 0, 50] ] ], + 'n_data': [ 1000, 400, 100 ], + 'K': 3, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 9 + 'means': [ [ 1, 30], [15, 10], [10, 20] ], + 'covs': [ [ [50, 0], [ 0, 50] ], + [ [30, 0], [ 0, 30] ], + [ [50, 0], [ 0, 50] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 10 + 'means': [ [ 1, 30], [15, 10], [10, 20] ], + 'covs': [ [ [50, 0], [ 0, 50] ], + [ [30, 0], [ 0, 30] ], + [ [50, 0], [ 0, 50] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 11 + 'means': [ [ 1, 30], [15, 10], [10, 20] ], + 'covs': [ [ [30, 0], [ 0, 20] ], + [ [60, 0], [ 0, 50] ], + [ [70, 0], [ 0, 30] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 12 + 'means': [ [ 1, 30], [15, 10], [10, 20] ], + 'covs': [ [ [30, 0], [ 0, 20] ], + [ [60, 0], [ 0, 50] ], + [ [70, 0], [ 0, 30] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 13 + 'means': [ [ 1, 30], [15, 10], [10, 20] ], + 'covs': [ [ [50, 25], [25, 40] ], + [ [30,-20], [-20,60] ], + [ [20,-15], [-15, 50] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 14 + 'means': [ [ 1, 30], [15, 10], [10, 20] ], + 'covs': [ [ [50, 25], [25, 40] ], + [ [30,-20], [-20,60] ], + [ [20,-15], [-15, 50] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 15 + 'x_func': [ '20 * np.random.random((500,))', + '25 * np.random.random((500,)) + 3'], + 'y_func': [ '5 * np.ones(500)', + '2 * np.ones(500)'], + 'K': 2, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 16 + 'x_func': [ '20 * np.random.random((500,))', + '25 * np.random.random((500,)) + 3'], + 'y_func': [ '5 * np.ones(500)', + '2 * np.ones(500)'], + 'K': 2, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 17 + 'x_func': [ 'np.pi * np.random.random((500,))', + 'np.pi * np.random.random((500,))'], + 'y_func': [ 'np.sin(x) + 3', + '3 * np.sin(x + np.pi / 2) - 7'], + 'K': 2, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'None' + }, + { # 18 + 'x_func': [ 'np.pi * np.random.random((500,))', + 'np.pi * np.random.random((500,))'], + 'y_func': [ 'np.sin(x) + 3', + '3 * np.sin(x + np.pi / 2) - 7'], + 'K': 2, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'None' + }, + { # 19 + 'means': [ [-10, 50], [30, 10], [10, 20] ], + 'covs': [ [ [50, 25], [25, 40] ], + [ [30, 20], [20, 60] ], + [ [20, 0], [ 0, 50] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans++', + 'cluster': 'KMeans', + 'optimizer':'GapsOptimizer' + }, + { # 20 + 'means': [ [-10, 50], [30, 10], [10, 20] ], + 'covs': [ [ [50, 25], [25, 40] ], + [ [30, 20], [20, 60] ], + [ [20, 0], [ 0, 50] ] ], + 'n_data': [ 500, 500, 500 ], + 'K': 3, + 'init': 'kmeans', + 'cluster': 'GMM', + 'optimizer':'GapsOptimizer' + }, +] + +def generate_data(means, covs, n_data): + ds, ls = [], [] + for i in range(len(means)): + d = np.random.multivariate_normal(means[i], covs[i], n_data[i]) + ds.append(d) + ls.append(np.ones((n_data[i],), dtype=int) * i) + data = np.concatenate(ds) + lable = np.concatenate(ls) + return data, lable + +def generate_data_by_func(x_func, y_func): + ds, ls = [], [] + for i in range(len(x_func)): + x = eval(x_func[i]) + y = eval(y_func[i]) + ds.append(list(zip(x, y))) + ls.append(np.ones(len(x)) * i) + data = np.concatenate(ds) + lable = np.concatenate(ls) + return data, lable + + +if __name__ == '__main__': + max_lab = len(l_config) + + # parse user input + if len(sys.argv) > 1 and sys.argv[1].isnumeric(): + lab_num = int(sys.argv[1]) + else: + print('Usage: python source.py lab_num (1 ~ {0})'.format(max_lab)) + sys.exit(-1) + + if lab_num > max_lab or lab_num < 1: + print('Usage: python source.py lab_num (1 ~ {0})'.format(max_lab)) + sys.exit(-1) + + # prepare dataset + if lab_num < 15 or lab_num > 18: + n_data = np.array(l_config[lab_num - 1]['n_data']) + means = np.array(l_config[lab_num - 1]['means']) + covs = np.array(l_config[lab_num - 1]['covs']) + K = l_config[lab_num - 1]['K'] + init = l_config[lab_num - 1]['init'] + cluster = l_config[lab_num - 1]['cluster'] + optimizer = l_config[lab_num - 1]['optimizer'] + data, labels = generate_data(means, covs, n_data) + else: + x_func = l_config[lab_num-1]['x_func'] + y_func = l_config[lab_num-1]['y_func'] + K = l_config[lab_num - 1]['K'] + init = l_config[lab_num - 1]['init'] + cluster = l_config[lab_num - 1]['cluster'] + optimizer = l_config[lab_num - 1]['optimizer'] + data, labels = generate_data_by_func(x_func, y_func) + + if optimizer == 'None': + model = KMeans(init=init) if cluster == 'KMeans' else GaussianMixture(init=init) + K, centroids_p, labels_p, errors_log = model.fit(data, K) + else: + if cluster == 'KMeans': + model = ClusteringAlgorithm('KMeans', 'GapsOptimizer') + else: + model = ClusteringAlgorithm('GaussianMixture', 'XICOptimizer') + K, centroids_p, labels_p, errors_log = model.fit(data) + + plot_fit(cluster, K, data, + labels, #means, + errors_log, + labels_p, centroids_p, + ) \ No newline at end of file