経験的確率密度間の重複を計算する方法は？

14

2つのサンプル間の類似性の尺度として、Rの2つのカーネル密度推定値間のオーバーラップ領域を計算する方法を探しています。明確にするために、次の例では、紫がかった重複領域の面積を定量化する必要があります。

library(ggplot2)
set.seed(1234)
d <- data.frame(variable=c(rep("a", 50), rep("b", 30)), value=c(rnorm(50), runif(30, 0, 3)))
ggplot(d, aes(value, fill=variable)) + geom_density(alpha=.4, color=NA)

ここに画像の説明を入力してください

同様の質問がここで議論されました。違いは、事前定義された正規分布ではなく、任意の経験的データに対してこれを行う必要があることです。overlapパッケージアドレスこの質問が、どうやら私だけのために動作しないタイムスタンプデータ、のために。Bray-Curtisインデックス（veganパッケージのvegdist(method="bray")関数に実装されている）も関連しているように見えますが、やはりデータが多少異なります。

理論的なアプローチと、それを実装するために使用する可能性のあるR関数の両方に興味があります。

r probability pdf kernel-smoothing

— mmk
ソース

2

「紫色の領域を定量化する」ことは、仮説検定ではなく推定の問題であるため、「標準の引用可能な統計検定を使用してこれを達成する」ことは望めません。あなたは自分自身に矛盾します。あなたが実際に欲しいものを明確にしてください。必要なのが2つのKDEの重複領域の推定だけである場合、それは簡単な計算です。

— グレン_b-モニカの復帰2014

@Glen_bはコメントに感謝し、私の非統計学者の考え方を明確にするのに役立ちました。私は、KDE間の重複領域が本当に私が探しているものだと信じています-私はそれを反映するために質問を編集しました。

— mmk

2

私は、この方法における意性のリスクについて非常に心配しています。カーネルの帯域幅に応じて、間の計算された重複の任意の 2つのデータセットは、区間内の任意の選択された値に等しくなるように作ることができる

。デフォルトの帯域幅はこの目的のために最適化されていないため、驚くべき、,意的な、または一貫性のない結果をもたらす可能性があります。自然な境界を持つデータセット（非負のデータや比率など）は、不要なエッジ効果をさらに導入します。代わりに何をしますか？この計算の理由から始めます。この「類似性」とはどういう意味ですか？

(0, 1)

$(0,1)$

— whuber

同じ質問が数か月後に現れましたが、交点について言及しましたが、考慮に入れるべきいくつかの有効なメモがありました。参照されている質問には、2つの経験的分布があります。この投稿はカーネル密度の推定と正規分布を介してのみこれに答えるため、リンクを追加します。以下のリンクは、経験的分布のペアに関する質問に拡張すると思います。stats.stackexchange.com/questions/122857/…–バーナビー7時間前

— バーナビー

9

2つのカーネル密度推定値の重複領域は、任意の精度で近似できます。

$\min(K_1(x),K_2(x))$

2つが異なるグリッド上にあり、同じグリッド上で簡単に再計算できない場合は、補間を使用できます。

2）交点（または複数の点）を見つけ、各間隔がより低い各間隔で2つのKDEの低い方を統合する場合があります。上の図では、交差点の左側にある青い曲線と右側にあるピンクの曲線を統合し、好きな方法で利用可能にします。これは、各カーネルコンポーネントをそのカットオフポイントの左または右に考慮することにより、本質的に正確に行うことができます。 $\frac{1}{h}K(\frac{x-x_i}{h})$

ただし、上記のwhuberのコメントは明確に心に留めておく必要があります。これは、必ずしも非常に意味のあることではありません。

— Glen_b -Reinstate Monica
ソース

方法1と方法2に関連するエラーをどのように計算しますか？

— olliepower

通常の状況では、両方ともカーネル密度の推定値の誤差と比較して非常に小さいため、あまり心配する必要はありません。もちろん、台形法やその他の数値積分で誤差範囲を計算できます-このような計算はかなり標準的ですが、KDEに大きな不確実性があることを考えると無意味です。方法2は、計算の累積丸め誤差に対して正確です。

— グレン_b-モニカの復活

1

これらの方法論の提案は理にかなっています。ご回答ありがとうございます。私はこれをRで実装することに取り組みますが、初心者としてこれをきれいにコーディングする方法についての提案に興味があります。

— mmk

10

完全を期すために、Rでこれを行う方法を次に示します。

# simulate two samples
a <- rnorm(100)
b <- rnorm(100, 2)

# define limits of a common grid, adding a buffer so that tails aren't cut off
lower <- min(c(a, b)) - 1 
upper <- max(c(a, b)) + 1

# generate kernel densities
da <- density(a, from=lower, to=upper)
db <- density(b, from=lower, to=upper)
d <- data.frame(x=da$x, a=da$y, b=db$y)

# calculate intersection densities
d$w <- pmin(d$a, d$b)

# integrate areas under curves
library(sfsmisc)
total <- integrate.xy(d$x, d$a) + integrate.xy(d$x, d$b)
intersection <- integrate.xy(d$x, d$w)

# compute overlap coefficient
overlap <- 2 * intersection / total

前述のように、KDEの生成と統合には、固有の不確実性と主観性が伴います。

— mmk
ソース

2

現在、CRANにはoverlapping、2つ（またはそれ以上）の経験的分布の重複領域を推定するパッケージがあります。ここにドキュメントをチェックアウト：rdocumentation.org/packages/overlapping/versions/1.5.0/topics/...

— ステファン・Avey

合計は次のとおりである必要があります：total = integrated.xy（d a）+ integration.xy（d b）-integration.xy（d w）。これはパッケージのオーバーラップを使用して確認できます。

x, d

$x, d$

x, d

$x, d$

x, d

$x, d$

— ラファエル

@mmkは2D密度に対してこれを行うことができますか？

— 嘘はありません

4

まず、間違っているかもしれませんが、カーネル密度推定（KDE）が交差する点が複数ある場合、ソリューションは機能しないと思います。2番目に、overlapタイムスタンプデータで使用するためにパッケージが作成されましたが、それを使用して任意の2つのKDEの重複領域を推定できます。データの範囲を0〜2πに変更するだけです。
例えば：

# simulate two sample    
 a <- rnorm(100)
 b <- rnorm(100, 2)

# To use overplapTrue(){overlap} the scale must be in radian (i.e. 0 to 2pi)
# To keep the *relative* value of a and b the same, combine a and b in the
# same dataframe before rescaling. You'll need to load the ‘scales‘ library.
# But first add a "Source" column to be able to distinguish between a and b
# after they are combined.
 a = data.frame( value = a, Source = "a" )
 b = data.frame( value = b, Source = "b" )
 d = rbind(a, b)
 library(scales) 
 d$value <- rescale( d$value, to = c(0,2*pi) )

# Now you can created the rescaled a and b vectors
 a <- d[d$Source == "a", 1]
 b <- d[d$Source == "b", 1]

# You can then calculate the area of overlap as you did previously.
# It should give almost exactly the same answers.
# Or you can use either the overlapTrue() and overlapEst() function 
# provided with the overlap packages. 
# Note that with these function the KDE are fitted using von Mises kernel.
 library(overlap)
  # Using overlapTrue():
   # define limits of a common grid, adding a buffer so that tails aren't cut off
     lower <- min(d$value)-1 
     upper <- max(d$value)+1
   # generate kernel densities
     da <- density(a, from=lower, to=upper, adjust = 1)
     db <- density(b, from=lower, to=upper, adjust = 1)
   # Compute overlap coefficient
     overlapTrue(da$y,db$y)


  # Using overlapEst():            
    overlapEst(a, b, kmax = 3, adjust=c(0.8, 1, 4), n.grid = 500)

# You can also plot the two KDEs and the region of overlap using overlapPlot()
# but sadly I haven't found a way of changing the x scale so that the scale 
# range correspond to the initial x value and not the rescaled value.
# You can only change the maximum value of the scale using the xscale argument 
# (i.e. it always range from 0 to n, where n is set with xscale = n).
# So if some of your data take negative value, you're probably better off with
# a different plotting method. You can change the x label with the xlab
# argument.  
  overlapPlot(a, b, xscale = 10, xlab= "x metrics", rug=T)

— S.ヴェネ
ソース