Kはエルボ法、BIC、分散の説明、およびシルエットを使用してKを選択するインコヒーレントな動作を意味します

K-meansで90個の特徴を持ついくつかのベクトルをクラスター化しようとしています。このアルゴリズムはクラスターの数を尋ねるので、いくつかの素晴らしい数学で選択を検証したいと思います。8〜10個のクラスターが必要です。機能はZスコアでスケーリングされています。

エルボ法と分散の説明

from scipy.spatial.distance import cdist, pdist
from sklearn.cluster import KMeans

K = range(1,50)
KM = [KMeans(n_clusters=k).fit(dt_trans) for k in K]
centroids = [k.cluster_centers_ for k in KM]

D_k = [cdist(dt_trans, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/dt_trans.shape[0] for d in dist]

# Total with-in sum of square
wcss = [sum(d**2) for d in dist]
tss = sum(pdist(dt_trans)**2)/dt_trans.shape[0]
bss = tss-wcss

kIdx = 10-1

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, bss/tss*100, 'b*-')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained')
plt.title('Elbow for KMeans clustering')

エルボ法

これらの2つの写真から、クラスターの数が止まることはないようです：D。奇妙な！肘はどこですか？Kを選択するにはどうすればよいですか？

ベイジアン情報基準

このメソッドはX-meansから直接提供され、BICを使用してクラスターの数を選択します。別の参照

    from sklearn.metrics import euclidean_distances
from sklearn.cluster import KMeans

def bic(clusters, centroids):
    num_points = sum(len(cluster) for cluster in clusters)
    num_dims = clusters[0][0].shape[0]
    log_likelihood = _loglikelihood(num_points, num_dims, clusters, centroids)
    num_params = _free_params(len(clusters), num_dims)
    return log_likelihood - num_params / 2.0 * np.log(num_points)


def _free_params(num_clusters, num_dims):
    return num_clusters * (num_dims + 1)


def _loglikelihood(num_points, num_dims, clusters, centroids):
    ll = 0
    for cluster in clusters:
        fRn = len(cluster)
        t1 = fRn * np.log(fRn)
        t2 = fRn * np.log(num_points)
        variance = _cluster_variance(num_points, clusters, centroids) or np.nextafter(0, 1)
        t3 = ((fRn * num_dims) / 2.0) * np.log((2.0 * np.pi) * variance)
        t4 = (fRn - 1.0) / 2.0
        ll += t1 - t2 - t3 - t4
    return ll

def _cluster_variance(num_points, clusters, centroids):
    s = 0
    denom = float(num_points - len(centroids))
    for cluster, centroid in zip(clusters, centroids):
        distances = euclidean_distances(cluster, centroid)
        s += (distances*distances).sum()
    return s / denom

from scipy.spatial import distance
def compute_bic(kmeans,X):
    """
    Computes the BIC metric for a given clusters

    Parameters:
    -----------------------------------------
    kmeans:  List of clustering object from scikit learn

    X     :  multidimension np array of data points

    Returns:
    -----------------------------------------
    BIC value
    """
    # assign centers and labels
    centers = [kmeans.cluster_centers_]
    labels  = kmeans.labels_
    #number of clusters
    m = kmeans.n_clusters
    # size of the clusters
    n = np.bincount(labels)
    #size of data set
    N, d = X.shape

    #compute variance for all clusters beforehand
    cl_var = (1.0 / (N - m) / d) * sum([sum(distance.cdist(X[np.where(labels == i)], [centers[0][i]], 'euclidean')**2) for i in range(m)])

    const_term = 0.5 * m * np.log(N) * (d+1)

    BIC = np.sum([n[i] * np.log(n[i]) -
               n[i] * np.log(N) -
             ((n[i] * d) / 2) * np.log(2*np.pi*cl_var) -
             ((n[i] - 1) * d/ 2) for i in range(m)]) - const_term

    return(BIC)



sns.set_style("ticks")
sns.set_palette(sns.color_palette("Blues_r"))
bics = []
for n_clusters in range(2,50):
    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(dt_trans)

    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_

    clusters = {}
    for i,d in enumerate(kmeans.labels_):
        if d not in clusters:
            clusters[d] = []
        clusters[d].append(dt_trans[i])

    bics.append(compute_bic(kmeans,dt_trans))#-bic(clusters.values(), centroids))

plt.plot(bics)
plt.ylabel("BIC score")
plt.xlabel("k")
plt.title("BIC scoring for K-means cell's behaviour")
sns.despine()
#plt.savefig('figures/K-means-BIC.pdf', format='pdf', dpi=330,bbox_inches='tight')

ここに画像の説明を入力してください

ここで同じ問題... Kとは何ですか？

シルエット

    from sklearn.metrics import silhouette_score

s = []
for n_clusters in range(2,30):
    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(dt_trans)

    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_

    s.append(silhouette_score(dt_trans, labels, metric='euclidean'))

plt.plot(s)
plt.ylabel("Silouette")
plt.xlabel("k")
plt.title("Silouette for K-means cell's behaviour")
sns.despine()

ここに画像の説明を入力してください

アレルヤ！ここでは理にかなっているようで、これは私が期待することです。しかし、なぜこれが他のものと異なるのですか？

clustering k-means

— マルコデナ
ソース

分散の場合の膝についての質問に答えるために、それは約6または7のように見えます。曲線に対する2つの線形近似セグメント間のブレークポイントとして想像できます。グラフの形状は異常ではありません。分散はしばしば漸近的に100％に近づきます。私は5の周りに、少し低いとしてあなたBICグラフでKを入れたい

— image_doctor

しかし、すべての方法で（ほぼ）同じ結果が得られるはずですよね？

— -marcodena

私は言うほど十分に知っているとは思わない。3つの方法がすべてのデータと数学的に同等であるとは思えません。2つの方法は近いクラスター数を与え、3番目の方法はそれよりも高いですが、それほど大きくはありません。クラスターの実際の数に関する先験的な情報はありますか？

— image_doctor

100％確信はありませんが、

— 8〜10

あなたはすでに「次元の呪い」のブラックホールにいます。次元削減の前には何も機能しません。

— カスラマンシャエイ

回答:

この質問が「未回答の質問」から削除されるように、上記のコメントの要約といくつかの考えを投稿するだけです。

Image_doctorのコメントは、これらのグラフがk-meansに典型的であるということです。（ただし、「シルエット」の測定値には詳しくありません。）kの増加に伴い、クラスター内の分散は連続的に減少することが予想されます。肘は、曲線が最も曲がる場所です。（数学的なものが必要な場合は、「2次導関数」と考えてください。）

通常、最終タスクを使用してkを選択するのが最善です。クラスターの統計的指標を使用して決定を下すのではなく、システムのエンドツーエンドのパフォーマンスを使用して選択をガイドしてください。開始点としてのみ統計を使用します。

— ヨアヒム・ワグナー
ソース

連続するセグメント間の角度を計算することにより、肘を見つけるのがより簡単になります。

以下を交換してください：

kIdx = 10-1

で：

seg_threshold = 0.95 #Set this to your desired target

#The angle between three points
def segments_gain(p1, v, p2):
    vp1 = np.linalg.norm(p1 - v)
    vp2 = np.linalg.norm(p2 - v)
    p1p2 = np.linalg.norm(p1 - p2)
    return np.arccos((vp1**2 + vp2**2 - p1p2**2) / (2 * vp1 * vp2)) / np.pi

#Normalize the data
criterion = np.array(avgWithinSS)
criterion = (criterion - criterion.min()) / (criterion.max() - criterion.min())

#Compute the angles
seg_gains = np.array([0, ] + [segments_gain(*
        [np.array([K[j], criterion[j]]) for j in range(i-1, i+2)]
    ) for i in range(len(K) - 2)] + [np.nan, ])

#Get the first index satisfying the threshold
kIdx = np.argmax(seg_gains > seg_threshold)

次のようなものが表示されます。

seg_gainsを視覚化すると、次のようなものが表示されます。

私はあなたが今トリッキーな肘を見つけることができることを望みます:)

— サロール
ソース

このような関数で最大曲率のポイントを検出するために、Kneedleアルゴリズムを実装しようとするPython ライブラリを作成しました。でインストールできます。pip install kneed

4つの異なる形状の関数のコードと出力：

from kneed.data_generator import DataGenerator
from kneed.knee_locator import KneeLocator

import numpy as np

import matplotlib.pyplot as plt

# sample x and y
x = np.arange(0,10)
y_convex_inc = np.array([1,2,3,4,5,10,15,20,40,100])
y_convex_dec = y_convex_inc[::-1]
y_concave_dec = 100 - y_convex_inc
y_concave_inc = 100 - y_convex_dec

# find the knee points
kn = KneeLocator(x, y_convex_inc, curve='convex', direction='increasing')
knee_yconvinc = kn.knee

kn = KneeLocator(x, y_convex_dec, curve='convex', direction='decreasing')
knee_yconvdec = kn.knee

kn = KneeLocator(x, y_concave_inc, curve='concave', direction='increasing')
knee_yconcinc = kn.knee

kn = KneeLocator(x, y_concave_dec, curve='concave', direction='decreasing')
knee_yconcdec = kn.knee

# plot
f, axes = plt.subplots(2, 2, figsize=(10,10));
yconvinc = axes[0][0]
yconvdec = axes[0][1]
yconcinc = axes[1][0]
yconcdec = axes[1][1]

yconvinc.plot(x, y_convex_inc)
yconvinc.vlines(x=knee_yconvinc, ymin=0, ymax=100, linestyle='--')
yconvinc.set_title("curve='convex', direction='increasing'")

yconvdec.plot(x, y_convex_dec)
yconvdec.vlines(x=knee_yconvdec, ymin=0, ymax=100, linestyle='--')
yconvdec.set_title("curve='convex', direction='decreasing'")

yconcinc.plot(x, y_concave_inc)
yconcinc.vlines(x=knee_yconcinc, ymin=0, ymax=100, linestyle='--')
yconcinc.set_title("curve='concave', direction='increasing'")

yconcdec.plot(x, y_concave_dec)
yconcdec.vlines(x=knee_yconcdec, ymin=0, ymax=100, linestyle='--')
yconcdec.set_title("curve='concave', direction='decreasing'");

— ケビン
ソース