誕生日のパラドックスを2人以上に拡大する

伝統的な誕生日のパラドックスでは、質問は「 $n$ 人のグループの2人以上が誕生日を共有する可能性はどれくらいか」です。私はこれの延長である問題で立ち往生しています。

2人が誕生日を共有する確率を知るのではなく、 $x$ 人以上の人が誕生日を共有する確率を知るために質問を拡張する必要があります。では $x=2$ 、あなたには二人が誕生日を共有していないとのことを引く確率を計算することにより、これを行うことができます $1$ が、私はより多くの数にこのロジックを拡張することができるとは思わない $x$ 。

これをさらに複雑にするために、 $n$ （数百万）および $x$ （数千）の非常に大きな数に対して機能するソリューションも必要です。

probability combinatorics birthday-paradox

— サイモン・アンドリュース
ソース

私はそれがバイオインフォマティクスの問題だと思います

— -csgillespie

それは実際にはバイオインフォマティクスの問題ですが、それは誕生日のパラドックスと同じ概念に要約されるので、私は無関係な詳細を保存すると思いました！

— サイモンアンドリュース

通常、私はあなたに同意しますが、この場合、あなたが尋ねることをする生体伝導体パッケージが既にあるかもしれないので、詳細は重要かもしれません。

— csgillespie

本当に知りたいのなら、それはパターン発見の問題であり、大きなシーケンスのセット内のサブシーケンスの特定のレベルの強化の確率を正確に推定しようとしています。したがって、関連するカウントを持つ一連のサブシーケンスがあり、観測したサブシーケンスの数と、理論的に観測可能なシーケンスの数がわかっています。10,000回の観測のうち特定のシーケンスを10回見た場合、偶然に発生した可能性を知る必要があります。

— サイモンアンドリュース

ほぼ8年後、この問題に対する答えをstats.stackexchange.com/questions/333471に投稿しました。ただし、

2次時間を要するため

大きな

コードは機能しません。

n,

$n,$

n

$n$

— whuber

回答:

これは、カウントの問題です：あるの可能な割り当てへの誕生日人。それらのうち、を、人以上の誕生日が共有されていない割り当ての数としますが、実際には少なくとも1つの誕生日が人によって共有されます。求める確率は、適切な値についてを合計し、その結果にを掛けることで見つけることができます。 $b^n$ $b$ $n$ $q(k; n, b)$ $k$ $k$ $q(k;n,b)$ $k$ $b^{-n}$

これらのカウントは、数百未満の値に対して正確に見つけることができます。ただし、単純な公式には従いません。誕生日を割り当てる方法のパターンを考慮する必要があります。一般的なデモを提供する代わりに、これを説明します。ましょう（これは、最小の面白い状況です）。可能性は次のとおりです。 $n$ $n = 4$

各人には固有の誕生日があります。コードは{4}です。
ちょうど2人が誕生日を共有します。コードは{2,1}です。
2人は1人の誕生日を持ち、他の2人は別の誕生日を持ちます。コードは{0,2}です。
3人が誕生日を共有します。コードは{1,0,1}です。
4人が誕生日を共有します。コードは{0,0,0,1}です。

一般的に、コードは、要素が正確に人が共有する生年月日を規定するカウントのタプルです。したがって、特に、 $\{a[1], a[2], \ldots\}$ $k^\text{th}$ $k$

1 a [1] + 2 a [2] + . . . + k a [k] + \dots = n .

$1 a[1] + 2a[2] + ... + k a[k] + \ldots = n.$

コードといずれかであっても誕生日ごとに2つの人々の最大値が達成された2つの方法があることは、この単純な場合、メモ、及びコードを有する別の。 $\{0,2\}$ $\{2,1\}$

任意のコードに対応する誕生日の割り当ての数を直接カウントできます。この数値は、3つの用語の積です。1つは多項係数です。それは分割の多くの方法カウントに人のグループ、のグループ、などを。グループの順序は問題ではないので、私たちはすることによって、この多項係数を分割する必要がある $n$ $a[1]$ $1$ $a[2]$ $2$ $a[1]!a[2]!\cdots$ ; その逆数は第2項です。最後に、グループを並べて、それぞれに誕生日を割り当てます。最初のグループには、2番目にはに、候補があります。これらの値を乗算して、3番目の項を形成する必要があります。これは「階乗積」に等しい。ここでは $b$ $b-1$ $b^{(a[1]+a[2]+\cdots)}$ $b^{(m)}$ 。 $b(b-1)\cdots(b-m+1)$

パターンのカウントをパターンのカウントに関連付ける明らかでかなり単純な再帰があります。これにより、控えめな値のカウントを迅速に計算できます。具体的には、表す正確で共有生年月日 $\{a[1], \ldots, a[k]\}$ $\{a[1], \ldots, a[k-1]\}$ $n$ $a[k]$ $a[k]$ $k$ 人それぞれ。これらの人のグループが人から引き出された後、異なる方法（たとえば）で行うことができますが、パターン残りの人々の間で。これにを掛けると、再帰が得られます。 $a[k]$ $k$ $n$ $x$ $\{a[1], \ldots, a[k-1]\}$ $x$

には閉形式の式があるとは思わないこれは、最大項が等しいすべてのパーティションのカウントを合計することによって得られる。いくつか例を示します。 $q(k; n, b)$ $n$ $k$

では（5つの可能な誕生日）と（4人）、我々は得ます $b=5$ $n=4$

\begin{aligned} q (1) & = q (1; 4, 5) & = 120 \\ q (2) & = 360 + 60 & = 420 \\ q (3) & = 80 \\ q (4) & = 5. \end{aligned}

$\eqalign{ q(1) &= q(1;4,5) &= 120 \\ q(2) &= 360 + 60 &= 420 \\ q(3) &&= 80 \\ q(4) &&= 5.\\ }$

たとえば、4人のうち3人以上が同じ「誕生日」（可能な日付のうち）を共有する可能性はます。 $5$ $(80 + 5)/625 = 0.136$

別の例として、およびます。ここでの値である最小のため（わずか6つのSIGの図には）： $b = 365$ $n = 23$ $q( k;23,365)$ $k$

\begin{aligned} k = 1 : & 0.49270 \\ k = 2 : & 0.494592 \\ k = 3 : & 0.0125308 \\ k = 4 : & 0.000172844 \\ k = 5 : & 1.80449 E - 6 \\ k = 6 : & 1.48722 E - 8 \\ k = 7 : & 9.92255 E - 11 \\ k = 8 : & 5.45195 E - 13. \end{aligned}

$\eqalign{ k=1: &0.49270 \\ k=2: &0.494592 \\ k=3: &0.0125308 \\ k=4: &0.000172844 \\ k=5: &1.80449E-6 \\ k=6: &1.48722E-8 \\ k=7: &9.92255E-11 \\ k=8: &5.45195E-13. }$

この手法を使用すると、87人の間で約50％の確率で（少なくとも）3方向の誕生日の衝突、187人の間で4方向の衝突の50％の確率、および50％の確率で187 310人の5方向の衝突。考慮すべきパーティションの数が大きくなるため、最後の計算は数秒かかります（とにかくMathematicaでは）。大幅に大きい場合、近似が必要です。 $n$

一の近似は期待とポアソン分布によって得られる、我々はから生じるように誕生日の割り当てを表示することができるので、ほとんど（しかしかなり）独立したポアソン変数期待各：任意の可能な誕生日のための変数人のうち何人がその誕生日を持っているかを説明します。したがって、最大値の分布はおよそここで、はポアソンCDFです。これは厳密な議論ではないので、少しテストしてみましょう。、の近似 $n/b$ $b$ $n/b$ $n$ $F(k)^b$ $F$ $n = 23$ は与える $b = 365$

\begin{aligned} k = 1 : & 0.498783 \\ k = 2 : & 0.496803 \\ k = 3 : & 0.014187 \\ k = 4 : & 0.000225115. \end{aligned}

$\eqalign{ k=1: &0.498783 \\ k=2: &0.496803\\ k=3: &0.014187\\ k=4: &0.000225115. }$

上記と比較することで、相対確率は小さい場合に劣ることがありますが、絶対確率は約0.5％にかなりよく近似していることがわかります。と広い範囲でテストすると、通常、近似はこの程度のものであることが示唆されます。 $n$ $b$

締めくくりには、聞かせてのは、元の質問を考えてみます。取る（観測値の数）および $n = 10,000$ （可能な「構造」の数）。「共有誕生日」の最大数のおおよその分布は $b = 1\,000\,000$

\begin{aligned} k = 1 : & 0 \\ k = 2 : & 0.8475 + \\ k = 3 : & 0.1520 + \\ k = 4 : & 0.0004 + \\ k > 4 : & < 1 E - 6. \end{aligned}

$\eqalign{ k=1: &0 \\ k=2: &0.8475+\\ k=3: &0.1520+\\ k=4: &0.0004+\\ k\gt 4: &\lt 1E-6. }$

（これは高速な計算です。）明らかに、10,000のうち1つの構造を10回観察することは非常に重要です。のでと両方とも大きく、私は非常によく、ここでの仕事への近似を期待しています。 $n$ $b$

ちなみに、シェーンが推測したように、シミュレーションは有用なチェックを提供できます。Mathematicaシミュレーションは次のような関数で作成されます

simulate[n_, b_] := Max[Last[Transpose[Tally[RandomInteger[{0, b - 1}, n]]]]];

次に、、 10,000回の反復を実行するこの例のように、反復および要約されます。 $n = 10000$ ケース： $b = 1\,000\,000$

Tally[Table[simulate[10000, 1000000], {n, 1, 10000}]] // TableForm

その出力は

2 8503

3 1493

4 4

これらの周波数は、ポアソン近似によって予測される周波数と密接に一致します。

— whuber
ソース

What a fantastic answer, thank you very much @whuber.

— JKnight

"There is an obvious and fairly simple recursion" — Namely?

— Kodiologist

@Kodiologist I inserted a brief description of the idea.

— whuber

+1 but where in the original question did you see that n=10000 and b=1mln? The OP looks like it is asking about n=1mln and k=10000, with b unspecified (presumably b=365). Not that it matters at this point :)

— amoeba says Reinstate Monica

@amoeba After all this time (six years, 1600 answers, and closely reading tens of thousands of posts) I cannot recall, but most likely I misinterpreted the last line. In my defense, note that if we read it literally the answer is immediate (upon applying a version of the Pigeonhole Principle): it is certain that among

n

$n$ =millions of people there will be at least one birthday that is shared among at least

x

$x$ =thousands of them!

— whuber

この問題をモンテカルロソリューションで解決することは常に可能ですが、それは最も効率的な方法とはほど遠いものです。Rの2人の問題の簡単な例（昨年のプレゼンテーションから。非効率的なコードの例としてこれを使用しました）は、2つ以上を考慮して簡単に調整できます。

birthday.paradox <- function(n.people, n.trials) {
    matches <- 0
    for (trial in 1:n.trials) {
        birthdays <- cbind(as.matrix(1:365), rep(0, 365))
        for (person in 1:n.people) {
            day <- sample(1:365, 1, replace = TRUE)
            if (birthdays[birthdays[, 1] == day, 2] == 1) {
                matches <- matches + 1
                break
            }
            birthdays[birthdays[, 1] == day, 2] <- 1
        }
        birthdays <- NULL
    }
    print(paste("Probability of birthday matches = ", matches/n.trials))
}

— シェーン
ソース

ここで複数のタイプのソリューションが機能するかどうかはわかりません。

I think that generalisation still only works for 2 or more people sharing a birthday - just that you can have different sub-classes of people.

— Simon Andrews

This is an attempt at a general solution. There may be some mistakes so use with caution!

First some notation:

$P(x,n)$ be the probability that $x$ or more people share a birthday among $n$ people,

$P(y|n)$ be the probability that exactly $y$ people share a birthday among $n$ people.

Notes:

Abuse of notation as $P(.)$ is being used in two different ways.
By definition $y$ cannot take the value of 1 as it does not make any sense and $y$ = 0 can be interpreted to mean that no one shares a common birthday.

Then the required probability is given by:

$P(x,n) = 1 - P(0|n) - P(2|n) - P(3|n) .... - P(x-1|n)$

Now,

$P(y|n) = {n \choose y} (\frac{365}{365})^y \ \prod_{k=1}^{k=n-y}(1 -\frac{k}{365})$

Here is the logic: You need the probability that exactly $y$ people share a birthday.

Step 1: You can pick $y$ people in ${n \choose y}$ ways.

Step 2: Since they share a birthday it can be any of the 365 days in a year. So, we basically have 365 choices which gives us $(\frac{365}{365})^y$ .

Step 3: The remaining $n-y$ people should not share a birthday with the first $y$ people or with each other. This reasoning gives us $\prod_{k=1}^{k=n-y}(1 -\frac{k}{365})$ .

You can check that for $x$ = 2 the above collapses to the standard birthday paradox solution.

Will this solution suffer from the curse of dimensionality? If instead of n=365, n=10^6 is this solution still feasible?

— csgillespie

Some approximations may have to be used to deal with high dimensions. Perhaps, use Stirling's approximation for factorials in the binomial coefficient. To deal with the product terms you could take logs and compute the sums instead of the products and then take the anti-log of the sum.

There are also several other forms of approximations possible using for example the Taylor series expansion for the exponential function. See the wiki page for these approximations: en.wikipedia.org/wiki/Birthday_problem#Approximations

Suppose y=2, n=4, and there are just two birthdays. Your formula, adapted by replacing 365 by 2, seems to say the probability that exactly 2 people share a birthday is Comb(4,2)*(2/2)^2*(1-1/2)*(1-2/2) = 0. (In fact, it's easy to see--by brute force enumeration if you like--that the probabilities that 2, 3, or 4 people share a "birthday" are 6/16, 8/16, and 2/16, respectively.) Indeed, whenever n-y >= 365, your formula yields 0, whereas as n gets large and y is fixed the probability should increase to a non-zero maximum before n reaches 365*y and then decrease, but never down to 0.

— whuber

Why you are replacing 365 by

n

$n$ ? The probability that 2 people share a birthday is computed as: 1 - Prob(they have unique birthday). Prob(that they have unique birthday) = (364/365). The logic is as follows: Pick a person. This person can have any day of the 365 days as a birthday. The second person can then only have a birthday on one of the remaining 364 days. Thus, the prob that they have a unique birthday is 364/365. I am not sure how you are calculating 6/16.