階層的クラスタリングのための正しいリンケージ方法の選択

Google BigQueryのredditデータダンプから収集および処理したデータに対して階層クラスタリングを実行しています。

私のプロセスは次のとおりです。

/ r / politicsで最新の1000件の投稿を取得
すべてのコメントを集める
データを処理し、n x mデータマトリックスを計算します（n：users / samples、m：posts / features）
階層的クラスタリングの距離行列を計算する
リンク方法を選択して、階層クラスタリングを実行します
データを樹状図としてプロットする

私の質問は、最適なリンケージ方法がどのように決定されるのですか？私は現在、使用していますWardが、私が使用してするかどうか私は知らないsingle、complete、average、など？

私はこのようなものに非常に新しいですが、私は1つが確かではないので、オンラインで明確な答えを見つけることができません。それでは、私のアプリケーションにとって良いアイデアは何でしょうか？n x mマトリックスに多くのゼロがあるという意味で、データは比較的まばらであることに注意してください（ほとんどの人は、数件以上の投稿にコメントしません）。

— ケビン・エガー
ソース

特定のリンケージの問題を別として、あなたの文脈で「最高」とはどういう意味ですか？

— GUNG -復活モニカ

私にとって最適なのは、私の種類のデータをリンクする最も論理的な方法を見つけることです。すなわち：どのアプローチが私の機能内の「距離」が意味するものを正確に定義します。

— ケビン・エゲル

ケビン、この答えと最近の質問をご覧ください。あなたはあなたが上昇している質問（「どの方法を使用するか」）が簡単なものではないことを学びます。メソッドの違いを見て選択できるようになる前に、クラスタリングに関する文献（少なくとも階層的）を必ず読んでください。データ分析は、手当たり次第に扱われるべきではありません。

— ttnphns

@ttnphns, thanks for the link - was a good read and I'll take those points in to consideration.

— Kevin Eger

Methods overview

Short reference about some linkage methods of hierarchical agglomerative cluster analysis (HAC).

Basic version of HAC algorithm is one generic; it amounts to updating, at each step, by the formula known as Lance-Williams formula, the proximities between the emergent (merged of two) cluster and all the other clusters (including singleton objects) existing so far. There exist implementations not using Lance-Williams formula. But using it is convenient: it lets one code various linkage methods by the same template.

The recurrence formula includes several parameters (alpha, beta, gamma). Depending on the linkage method, the parameters are set differently and so the unwrapped formula obtains a specific view. Many texts on HAC show the formula, its method-specific views and explain the methods. I would recommend articles by Janos Podani as very thorough.

さまざまな方法の余地と必要性は、2つのクラスター間またはクラスターとシングルトンオブジェクト間の近接度（距離または類似度）をさまざまな方法で定式化できるという事実から生じます。HACは各ステップで2つの最も近いクラスターまたはポイントをマージしますが、入力近接行列がシングルトンオブジェクト間でのみ定義された面で前述の近接度を計算する方法は、定式化する問題です。

そのため、これらの方法は、各ステップで2つのクラスター間の近接を定義する方法が異なります。「衝突係数」（凝集スケジュール/履歴で出力され、樹状図の「Y」軸を形成する）は、特定のステップでマージされた2つのクラスター間の距離です。

方法単一結合または最近傍。2つのクラスター間の近接度は、2つの最も近いオブジェクト間の近接度です。この値は、入力行列の値の1つです。このクラスターの構築の概念的なメタファー、その原型は、スペクトルまたはチェーンです。チェーンは、直線または曲線、または「スノーフレーク」または「アメーバ」ビューのようになります。最も類似していない2つのクラスタメンバは、最も類似している2つのクラスタメンバと比較して、非常に異なることがあります。単一リンケージ方式は、最近傍の類似性のみを制御します。
方法完全な結合または最も遠い隣人。2つのクラスター間の近接は、2つの最も遠いオブジェクト間の近接です。この値は、入力行列の値の1つです。この構築されたクラスターの隠phorは円であり（ある意味、趣味またはプロットによる）、互いから最も離れた2つのメンバーは（円のように）他のまったく異なるペアよりもはるかに異なることはできません。このようなクラスターは、境界が「コンパクトな」輪郭ですが、内部は必ずしもコンパクトではありません。
Method of between-group average linkage (UPGMA). Proximity between two clusters is the arithmetic mean of all the proximities between the objects of one, on one side, and the objects of the other, on the other side. The metaphor of this built of cluster is quite generic, just united class or close-knit collective; and the method is frequently set the default one in hierarhical clustering packages. Clusters of miscellaneous shapes and outlines can be produced.
単純平均、または均衡のあるグループ間平均リンケージ（WPGMA）の方法は、以前の修正です。2つのクラスター間の近接度は、一方のオブジェクトと他方のオブジェクト間のすべての近接度の算術平均です。一方、これらの2つのクラスターのそれぞれが最近マージされたサブクラスターは、サブクラスターのオブジェクト数が異なっていたとしても、その近接性に対する影響を均等化しました。
方法内のグループ平均リンケージ（MNDIS）。2つのクラスター間の近接度は、ジョイントクラスター内のすべての近接度の算術平均です。この方法はUPGMAの代替です。通常、クラスター密度の点では失われますが、UPGMAができないクラスター形状を明らかにすることもあります。
重心法（UPGMC）。2つのクラスター間の近接性は、それらの幾何学的な重心間の近接性です：[平方]ユークリッド距離。このクラスターの構築の比phorは、プラットフォームの近さ（政治）です。政党のように、そのようなクラスターは分数または「派fact」を持つことができますが、その中心人物が互いに離れていない限り、組合は一貫しています。クラスターは、アウトラインによってさまざまになります。
中央値、または平衡重心法（WPGMC）は以前の修正です。2つのクラスター間の近接度は、それらの幾何学的重心間の近接度です（それらの間の[平方]ユークリッド距離）。一方、これらの2つのクラスターのそれぞれが最近マージされたサブクラスターは、オブジェクトの数が異なっていても、その重心への影響が等しくなるように重心が定義されています。
ウォードの方法、または最小2乗和（MISSQ）であり、「最小分散」法と呼ばれることもあります。2つのクラスター間の近接度は、それらのジョイントクラスターの合計平方がこれら2つのクラスターの合計平方合計よりも大きくなる大きさです。 $SS_{12}-(SS_1+SS_2)$ . (Between two singleton objects this quantity = squared euclidean distance / $2$ .) The metaphor of this built of cluster is type. Intuitively, a type is a cloud more dense and more concentric towards its middle, whereas marginal points are few and could be scattered relatively freely.

Some among less well-known methods (see Podany J. New combinatorial clustering methods // Vegetatio, 1989, 81: 61-77.) [also implemented by me as a SPSS macro found on my web-page]:

Method of minimal sum-of-squares (MNSSQ). Proximity between two clusters is the summed square in their joint cluster: $SS_{12}$ . (Between two singleton objects this quantity = squared euclidean distance / $2$ .)
Method of minimal increase of variance (MIVAR). Proximity between two clusters is the magnitude by which the mean square in their joint cluster will be greater than the weightedly (by the number of objects) averaged mean square in these two clusters: $MS_{12}-(n_1MS_1+n_2MS_2)/(n_1+n_2) = [SS_{12}-(SS_1+SS_2)]/(n_1+n_2)$ . (Between two singleton objects this quantity = squared euclidean distance / $4$ .)
Method of minimal variance (MNVAR). Proximity between two clusters is the mean square in their joint cluster: $MS_{12} = SS_{12}/(n_1+n_2)$ . (Between two singleton objects this quantity = squared euclidean distance / $4$ .).

First 5 methods permit any proximity measures (any similarities or distances) and results will, naturally, depend on the measure chosen.

Last 6 methods require distances; and fully correct will be to use only squared euclidean distances with them, because these methods compute centroids in euclidean space. Therefore distances should be euclidean for the sake of geometric correctness (these 6 methods are called together geometric linkage methods). At worst case, you might input other metric distances at admitting more heuristic, less rigorous analysis. Now about that "squared". Computation of centroids and deviations from them are most convenient mathematically/programmically to perform on squared distances, that's why HAC packages usually require to input and are tuned to process the squared ones. However, there exist implementations - fully equivalent yet a bit slower - based on nonsquared distances input and requiring those; see for example "Ward-2" implementation for Ward's method. You should consult with the documentation of you clustering program to know which - squared or not - distances it expects at input to a "geometric method" in order to do it right.

Methods MNDIS, MNSSQ, and MNVAR require on steps, in addition to just update the Lance-Williams formula, to store a within-cluster statistic (which depends on the method).

Methods which are most frequently used in studies where clusters are expected to be solid more or less round clouds, - are methods of average linkage, complete linkage method, and Ward's method.

Ward's method is the closest, by it properties and efficiency, to K-means clustering; they share the same objective function - minimization of the pooled within-cluster SS "in the end". Of course, K-means (being iterative and if provided with decent initial centroids) is usually a better minimizer of it than Ward. However, Ward seems to me a bit more accurate than K-means in uncovering clusters of uneven physical sizes (variances) or clusters thrown about space very irregularly. MIVAR method is weird to me, I can't imagine when it could be recommended, it doesn't produce dense enough clusters.

Methods centroid, median, minimal increase of variance – may give sometimes the so-called reversals: a phenomenon when the two clusters being merged at some step appear closer to each other than pairs of clusters merged earlier. That is because these methods do not belong to the so called ultrametric. This situation is inconvenient but is theoretically OK.

Methods of single linkage and centroid belong to so called space contracting, or “chaining”. That means - roughly speaking - that they tend to attach objects one by one to clusters, and so they demonstrate relatively smooth growth of curve “% of clustered objects”. On the contrary, methods of complete linkage, Ward’s, sum-of-squares, increase of variance, and variance commonly get considerable share of objects clustered even on early steps, and then proceed merging yet those – therefore their curve “% of clustered objects” is steep from the first steps. These methods are called space dilating. Other methods fall in-between.

Flexible versions. By adding the additional parameter into the Lance-Willians formula it is possible to make a method become specifically self-tuning on its steps. The parameter brings in correction for the being computed between-cluster proximity, which depends on the size (amount of de-compactness) of the clusters. The meaning of the parameter is that it makes the method of agglomeration more space dilating or space contracting than the standard method is doomed to be. Most well-known implementation of the flexibility so far is to average linkage methods UPGMA and WPGMA (Belbin, L. et al. A Comparison of Two Approaches to Beta-Flexible Clustering // Multivariate Behavioral Research, 1992, 27, 417–433.).

Dendrogram. On a dendrogram "Y" axis, typically displayed is the proximity between the merging clusters - as defined by methods above. Therefore, for example, in centroid method the squared distance is typically gauged (ultimately, it depends on the package and it options) - some researches are not aware of that. Also, by tradition, with methods based on increment of nondensity, such as Ward’s, usually shown on the dendrogram is cumulative value - it is sooner for convenience reasons than theoretical ones. Thus, (in many packages) the plotted coefficient in Ward’s method represents the overall, across all clusters, within-cluster sum-of-squares observed at the moment of a given step.

One should refrain from judging which linkage method is "better" for his data by comparing the looks of the dendrograms: not only because the looks change when you change what modification of the coefficient you plot there - as it was just described, - but because the look will differ even on the data with no clusters.

To choose the "right" method

There is no single criterion. Some guidelines how to go about selecting a method of cluster analysis (including a linkage method in HAC as a particular case) are outlined in this answer and the whole thread therein.

— ttnphns
ソース

The correlation between the distance matrix and the cophenetic distance is one metric to help assess which clustering linkage to select. From ?cophenetic:

It can be argued that a dendrogram is an appropriate summary of some data if the correlation between the original distances and the cophenetic distances is high.

This use of cor(dist,cophenetic(hclust(dist))) as a linkage selection metric is referenced in pg 38 of this vegan vignette.

See example code below:

# Data
d0=dist(USArrests)

# Hierarchical Agglomerative Clustering
h1=hclust(d0,method='average')
h2=hclust(d0,method='complete')
h3=hclust(d0,method='ward.D')
h4=hclust(d0,method='single')

# Cophenetic Distances, for each linkage
c1=cophenetic(h1)
c2=cophenetic(h2)
c3=cophenetic(h3)
c4=cophenetic(h4)

# Correlations
cor(d0,c1) # 0.7658983
cor(d0,c2) # 0.7636926
cor(d0,c3) # 0.7553367
cor(d0,c4) # 0.5702505

# Dendograms
par(mfrow=c(2,2))
plot(h1,main='Average Linkage')
plot(h2,main='Complete Linkage')
plot(h3,main='Ward Linkage')
plot(h4,main='Single Linkage')
par(mfrow=c(1,1))

We see that the correlations for average and complete are extremely similar, and their dendograms appear very similar. The correlation for ward is similar to average and complete but the dendogram looks fairly different. single linkage is doing its own thing. Best professional judgement from a subject matter expert, or precedence toward a certain link in the field of interest should probably override numeric output from cor().

— kakarot
ソース