Methods overview
Short reference about some linkage methods of hierarchical agglomerative cluster analysis (HAC).
Basic version of HAC algorithm is one generic; it amounts to updating, at each step, by the formula known as Lance-Williams formula, the proximities between the emergent (merged of two) cluster and all the other clusters (including singleton objects) existing so far. There exist implementations not using Lance-Williams formula. But using it is convenient: it lets one code various linkage methods by the same template.
The recurrence formula includes several parameters (alpha, beta, gamma). Depending on the linkage method, the parameters are set differently and so the unwrapped formula obtains a specific view. Many texts on HAC show the formula, its method-specific views and explain the methods. I would recommend articles by Janos Podani as very thorough.
さまざまな方法の余地と必要性は、2つのクラスター間またはクラスターとシングルトンオブジェクト間の近接度(距離または類似度)をさまざまな方法で定式化できるという事実から生じます。HACは各ステップで2つの最も近いクラスターまたはポイントをマージしますが、入力近接行列がシングルトンオブジェクト間でのみ定義された面で前述の近接度を計算する方法は、定式化する問題です。
そのため、これらの方法は、各ステップで2つのクラスター間の近接を定義する方法が異なります。「衝突係数」(凝集スケジュール/履歴で出力され、樹状図の「Y」軸を形成する)は、特定のステップでマージされた2つのクラスター間の距離です。
方法単一結合または最近傍。2つのクラスター間の近接度は、2つの最も近いオブジェクト間の近接度です。この値は、入力行列の値の1つです。このクラスターの構築の概念的なメタファー、その原型は、スペクトルまたはチェーンです。チェーンは、直線または曲線、または「スノーフレーク」または「アメーバ」ビューのようになります。最も類似していない2つのクラスタメンバは、最も類似している2つのクラスタメンバと比較して、非常に異なることがあります。単一リンケージ方式は、最近傍の類似性のみを制御します。
方法完全な結合または最も遠い隣人。2つのクラスター間の近接は、2つの最も遠いオブジェクト間の近接です。この値は、入力行列の値の1つです。この構築されたクラスターの隠phorは円であり(ある意味、趣味またはプロットによる)、互いから最も離れた2つのメンバーは(円のように)他のまったく異なるペアよりもはるかに異なることはできません。このようなクラスターは、境界が「コンパクトな」輪郭ですが、内部は必ずしもコンパクトではありません。
Method of between-group average linkage (UPGMA). Proximity
between two clusters is the arithmetic mean of all the proximities
between the objects of one, on one side, and the objects of the
other, on the other side. The metaphor of this built of cluster is quite generic, just united class or close-knit collective; and the method is frequently set the default one in hierarhical clustering packages. Clusters of miscellaneous shapes and outlines can be produced.
単純平均、または均衡のあるグループ間平均リンケージ(WPGMA)の方法は、以前の修正です。2つのクラスター間の近接度は、一方のオブジェクトと他方のオブジェクト間のすべての近接度の算術平均です。一方、これらの2つのクラスターのそれぞれが最近マージされたサブクラスターは、サブクラスターのオブジェクト数が異なっていたとしても、その近接性に対する影響を均等化しました。
方法内のグループ平均リンケージ(MNDIS)。2つのクラスター間の近接度は、ジョイントクラスター内のすべての近接度の算術平均です。この方法はUPGMAの代替です。通常、クラスター密度の点では失われますが、UPGMAができないクラスター形状を明らかにすることもあります。
重心法(UPGMC)。2つのクラスター間の近接性は、それらの幾何学的な重心間の近接性です:[平方]ユークリッド距離。このクラスターの構築の比phorは、プラットフォームの近さ(政治)です。政党のように、そのようなクラスターは分数または「派fact」を持つことができますが、その中心人物が互いに離れていない限り、組合は一貫しています。クラスターは、アウトラインによってさまざまになります。
中央値、または平衡重心法(WPGMC)は以前の修正です。2つのクラスター間の近接度は、それらの幾何学的重心間の近接度です(それらの間の[平方]ユークリッド距離)。一方、これらの2つのクラスターのそれぞれが最近マージされたサブクラスターは、オブジェクトの数が異なっていても、その重心への影響が等しくなるように重心が定義されています。
ウォードの方法、または最小2乗和(MISSQ)であり、「最小分散」法と呼ばれることもあります。2つのクラスター間の近接度は、それらのジョイントクラスターの合計平方がこれら2つのクラスターの合計平方合計よりも大きくなる大きさです。SS12− (SS1+SS2). (Between two singleton objects
this quantity = squared euclidean distance / 2.) The metaphor of this built of cluster is type. Intuitively, a type is a cloud more dense and more concentric towards its middle, whereas marginal points are few and could be scattered relatively freely.
Some among less well-known methods (see Podany J. New combinatorial clustering methods // Vegetatio, 1989, 81: 61-77.) [also implemented by me as a SPSS macro found on my web-page]:
Method of minimal sum-of-squares (MNSSQ). Proximity between two
clusters is the summed square in their joint cluster: SS12. (Between
two singleton objects this quantity = squared euclidean distance /
2.)
Method of minimal increase of variance (MIVAR). Proximity between
two clusters is the magnitude by which the mean square in their joint
cluster will be greater than the weightedly (by the number of
objects) averaged mean square in these two clusters:
MS12−(n1MS1+n2MS2)/(n1+n2)=[SS12−(SS1+SS2)]/(n1+n2). (Between two
singleton objects this quantity = squared euclidean distance / 4.)
Method of minimal variance (MNVAR). Proximity between two
clusters is the mean square in their joint cluster: MS12=SS12/(n1+n2). (Between two singleton objects this quantity = squared
euclidean distance / 4.).
First 5 methods permit any proximity measures (any similarities or distances) and results will, naturally, depend on the measure chosen.
Last 6 methods require distances; and fully correct will be to use only squared euclidean distances with them, because these methods compute centroids in euclidean space. Therefore distances should be euclidean for the sake of geometric correctness (these 6 methods are called together geometric linkage methods). At worst case, you might input other metric distances at admitting more heuristic, less rigorous analysis. Now about that "squared". Computation of centroids and deviations from them are most convenient mathematically/programmically to perform on squared distances, that's why HAC packages usually require to input and are tuned to process the squared ones. However, there exist implementations - fully equivalent yet a bit slower - based on nonsquared distances input and requiring those; see for example "Ward-2" implementation for Ward's method. You should consult with the documentation of you clustering program to know which - squared or not - distances it expects at input to a "geometric method" in order to do it right.
Methods MNDIS, MNSSQ, and MNVAR require on steps, in addition to just update the Lance-Williams formula, to store a within-cluster statistic (which depends on the method).
Methods which are most frequently used in studies where clusters are expected to be solid more or less round clouds, - are methods of average linkage, complete linkage method, and Ward's method.
Ward's method is the closest, by it properties and efficiency, to K-means clustering; they share the same objective function - minimization of the pooled within-cluster SS "in the end". Of course, K-means (being iterative and if provided with decent initial centroids) is usually a better minimizer of it than Ward. However, Ward seems to me a bit more accurate than K-means in uncovering clusters of uneven physical sizes (variances) or clusters thrown about space very irregularly. MIVAR method is weird to me, I can't imagine when it could be recommended, it doesn't produce dense enough clusters.
Methods centroid, median, minimal increase of variance – may give sometimes the so-called reversals: a phenomenon when the two clusters being merged at some step appear closer to each other than pairs of clusters merged earlier. That is because these methods do not belong to the so called ultrametric. This situation is inconvenient but is theoretically OK.
Methods of single linkage and centroid belong to so called space contracting, or “chaining”. That means - roughly speaking - that they tend to attach objects one by one to clusters, and so they demonstrate relatively smooth growth of curve “% of clustered objects”. On the contrary, methods of complete linkage, Ward’s, sum-of-squares, increase of variance, and variance commonly get considerable share of objects clustered even on early steps, and then proceed merging yet those – therefore their curve “% of clustered objects” is steep from the first steps. These methods are called space dilating. Other methods fall in-between.
Flexible versions. By adding the additional parameter into the Lance-Willians formula it is possible to make a method become specifically self-tuning on its steps. The parameter brings in correction for the being computed between-cluster proximity, which depends on the size (amount of de-compactness) of the clusters. The meaning of the parameter is that it makes the method of agglomeration more space dilating or space contracting than the standard method is doomed to be. Most well-known implementation of the flexibility so far is to average linkage methods UPGMA and WPGMA (Belbin, L. et al. A Comparison of Two Approaches to Beta-Flexible Clustering // Multivariate Behavioral Research, 1992, 27, 417–433.).
Dendrogram. On a dendrogram "Y" axis, typically displayed is the proximity between the merging clusters - as defined by methods above. Therefore, for example, in centroid method the squared distance is typically gauged (ultimately, it depends on the package and it options) - some researches are not aware of that. Also, by tradition, with methods based on increment of nondensity, such as Ward’s, usually shown on the dendrogram is cumulative value - it is sooner for convenience reasons than theoretical ones. Thus, (in many packages) the plotted coefficient in Ward’s method represents the overall, across all clusters, within-cluster sum-of-squares observed at the moment of a given step.
One should refrain from judging which linkage method is "better" for his data by comparing the looks of the dendrograms: not only because the looks change when you change what modification of the coefficient you plot there - as it was just described, - but because the look will differ even on the data with no clusters.
To choose the "right" method
There is no single criterion. Some guidelines how to go about selecting a method of cluster analysis (including a linkage method in HAC as a particular case) are outlined in this answer and the whole thread therein.