一維陣列的聚類

聚類分析 · 發表 2018-09-23 08:47:58

摘要：需求：分析訂單的價格分佈方案：按照100為梯度，分析不同價格區間的訂單量缺陷：現實生活中，定價存在一些自然的價格分隔，如果按照步距劃分可能存在一些偏差，比如airbnb的價格篩選顯示出的房價分佈：解決上述缺陷最好的方式是對價格進行聚類，找出做合適的價格區間。在...

需求：分析訂單的價格分佈

方案：按照100為梯度，分析不同價格區間的訂單量

缺陷：現實生活中，定價存在一些自然的價格分隔，如果按照步距劃分可能存在一些偏差，比如airbnb的價格篩選顯示出的房價分佈：

解決上述缺陷最好的方式是對價格進行聚類，找出做合適的價格區間。

在學習聚類演算法的過程中，學習到的聚類演算法大部分都是針對n維的，針對一維資料的聚類方式較少，今天就來學習下如何給一維的資料進行聚類。

方案一：採用 ofollow,noindex" target="_blank">K-Means 對一維資料聚類

Python程式碼如下：

from sklearn.cluster import KMeans
import numpy as np
x = np.random.random(10000)
y = x.reshape(-1,1)
km = KMeans()
km.fit(y)

核心的操作是y = x.reshape(-1,1)，含義為將一維資料變成只有1列，行數不知道多少（-1代表根據剩下的維度計算出陣列的另外一個shape屬性值）。

方案二：採用一維聚類方法Jenks Natural Breaks

Jenks Natural Breaks（自然斷點分類）。一般來說，分類的原則就是差不多的放在一起，分成若干類。統計上可以用方差來衡量，通過計算每類的方差，再計算這些方差之和，用方差和的大小來比較分類的好壞。因而需要計算各種分類的方差和，其值最小的就是最優的分類結果（但並不唯一）。這也是自然斷點分類法的原理。另外，當你去看資料的分佈時，可以比較明顯的發現斷裂之處，這些斷裂之處和Jenks Natural Breaks方法算出來也是一致的。因而這種分類法很“自然”。

Jenks Natural Breaks和K Means在一維資料時，完全等價。它們的目標函式一樣，但是演算法的步驟不完全相同。K Means是先設定好K個初始隨機點。而Jenks Breaks則是用遍歷的方法，一個點一個點地移動，直到達到最小值。

Natural Breaks演算法又有兩種：

Jenks-Caspall algorithm(1971)，是Jenks和Caspall發明的演算法。原理就如前所述，實現的時候要將每種分類情況都計算一遍，找到方差和最小的那一種，計算量極大。n個數分成k類，就要從n-1個數中找k-1個組合，這個數目是很驚人的。資料量較大時，如果分類又多，以當時的計算機水平根本不能窮舉各種可能性。
Fisher-Jenks algorithm(1977)，Fisher(1958)發明了一種演算法提高計算效率，不需要進行窮舉。Jenks將這種方法引入到資料分類中。但後來者幾乎只知道Jenks而不知Fisher了。

具體演算法實現：

Jenks-Caspall algorithm： https://github.com/domlysz/Jenks-Caspall.py
Fisher-Jenks algorithm： https://github.com/mthh/jenkspy

和K-Means一樣，使用Jenks Natural Breaks需要先確定聚類數量K值。常見的方法是：GVF（The Goodness of Variance Fit）。GVF，翻譯過來是“方差擬合優度”，公式如下：

其中，SDAM是the Sum of squared Deviations from the Array Mean，即原始資料的方差；SDCM是the Sum of squared Deviations about Class Mean，即每一類方差的和。顯然，SDAM是一個常數，而SDCM與分類數k有關。一定範圍內，GVF越大，分類效果越好。SDCM越小，GVF越大，越接近於1。而SDCM隨k的增大而大，當k等於n時，SDMC=0，GVF=1。

GVF用於判定不同分類數的分類效果好壞。以k和GVF做圖可得：

隨著k的增大，GVF曲線變得越來越平緩。特別是在紅線處（k=5），曲線變得基本平坦（之前起伏較大，之後起伏較小），k(5)也不是很大，所以可以分為5類。一般來說，GVF>0.7就可以接受了，當然越高越好，但一定要考慮k不能太大。顯然，這是一個經驗公式，但總比沒有好吧。

程式碼示例：

from jenkspy import jenks_breaks
import numpy as np


def goodness_of_variance_fit(array, classes):
# get the break points
classes = jenks_breaks(array, classes)

# do the actual classification
classified = np.array([classify(i, classes) for i in array])

# max value of zones
maxz = max(classified)

# nested list of zone indices
zone_indices = [[idx for idx, val in enumerate(classified) if zone + 1 == val] for zone in range(maxz)]

# sum of squared deviations from array mean
sdam = np.sum((array - array.mean()) ** 2)

# sorted polygon stats
array_sort = [np.array([array[index] for index in zone]) for zone in zone_indices]

# sum of squared deviations of class means
sdcm = sum([np.sum((classified - classified.mean()) ** 2) for classified in array_sort])

# goodness of variance fit
gvf = (sdam - sdcm) / sdam

return gvf


def classify(value, breaks):
for i in range(1, len(breaks)):
if value < breaks[i]:
return i
return len(breaks) - 1


if __name__ == '__main__':
gvf = 0.0
nclasses = 2
array = np.random.random(10000)
while gvf < .8:
gvf = goodness_of_variance_fit(array, nclasses)
print(nclasses, gvf)
nclasses += 1

參考連結：

方案三：核密度估計Kernel Density Estimation

所謂核密度估計，就是採用平滑的峰值函式(“核”)來擬合觀察到的資料點，從而對真實的概率分佈曲線進行模擬。核密度估計更多詳細內容，可以參考先前的 Mean Shift聚類中的相關說明。

使用示例：

import numpy as np
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
from sklearn.neighbors.kde import KernelDensity

a = np.array([10, 11, 9, 23, 21, 11, 45, 20, 11, 12]).reshape(-1, 1)
kde = KernelDensity(kernel='gaussian', bandwidth=3).fit(a)
s = np.linspace(0, 50)
e = kde.score_samples(s.reshape(-1, 1))
plt.plot(s, e)
plt.show()

mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
print("Minima:", s[mi])
print("Maxima:", s[ma])
print(a[a < mi[0]], a[(a >= mi[0]) * (a <= mi[1])], a[a >= mi[1]])

plt.plot(s[:mi[0] + 1], e[:mi[0] + 1], 'r',
s[mi[0]:mi[1] + 1], e[mi[0]:mi[1] + 1], 'g',
s[mi[1]:], e[mi[1]:], 'b',
s[ma], e[ma], 'go',
s[mi], e[mi], 'ro')
plt.show()

輸出內容：

Minima: [17.34693878 33.67346939]
Maxima: [10.20408163 21.42857143 44.89795918]
[10 119 11 11 12] [23 21 20] [45]

參考連結：

https://en.wikipedia.org/wiki/Kernel_density_estimation
scikit-learn.org/stable/auto_examples/neighbors/plot_kde_1d.html" rel="nofollow,noindex" target="_blank">http://scikit-learn.org/stable/auto_examples/neighbors/plot_kde_1d.html
https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

一維陣列的聚類

方案一：採用 ofollow,noindex" target="_blank">K-Means 對一維資料聚類

方案二：採用一維聚類方法Jenks Natural Breaks

方案三：核密度估計Kernel Density Estimation

您可能也會喜歡…