생일 역설을 2 명 이상으로 확대

전통적인 생일 역설에서 문제는 " 명 그룹의 둘 이상의 사람들 이 생일을 공유 할 가능성은 무엇인가 "입니다. 나는 이것의 확장 인 문제에 갇혀있다. $n$

두 사람이 생일을 공유 할 확률을 아는 대신, 명 이상의 사람들이 생일을 공유 할 확률을 알기 위해 질문을 확장해야합니다 . 를 사용하면 두 사람이 생일을 공유 하지 않고 에서 빼는 확률을 계산 하여이 작업을 수행 할 수 있지만이 논리를 더 많은 수의 로 확장 할 수는 없다고 생각 합니다. $x$ $x=2$ $1$ $x$

이것을 더 복잡하게 만들려면 (수백만)과 (수천)에 대해 매우 많은 수의 솔루션이 필요합니다 . $n$ $x$

probability combinatorics birthday-paradox

— 사이먼 앤드류스
소스

나는 그것이 생물 정보학 문제라고 생각합니다

— csgillespie

실제로 생물 정보학 문제이지만 생일 역설과 같은 개념으로 요약되었으므로 관련이없는 세부 사항을 저장한다고 생각했습니다!

— Simon Andrews

일반적으로 나는 당신에게 동의하지만,이 경우에 당신이 요구하는 것을 수행하는 생물 전도체 패키지가 이미있을 수 있기 때문에 세부 사항이 중요 할 수 있습니다.

— csgillespie

정말로 알고 싶다면 큰 시퀀스 세트 내에서 서브 시퀀스의 주어진 수준의 강화 가능성을 정확하게 추정하려고하는 패턴 찾기 문제입니다. 따라서 관련된 수를 가진 일련의 하위 시퀀스가 있으며 관찰 한 하위 시퀀스 수와 이론적으로 관찰 가능한 시퀀스 수를 알고 있습니다. 10,000 번의 관측치 중 10 번의 특정 순서를 본다면 이것이 우연히 일어날 가능성을 알아야합니다.

— Simon Andrews

거의 8 년 후에이 문제에 대한 답변을 stats.stackexchange.com/questions/333471에 게시했습니다 . 코드는 대형에 대한 작업이하지 않는

가에 차 시간이 필요하기 때문에, 비록

n,

$n,$

n

$n$

— whuber

답변:

이것은 계산 문제 : 거기에 의 가능한 할당 에 생일 사람들이. 그 중 는 명 이상이 생일을 공유하지 않지만 실제로는 명이 공유하는 생일이 하나 이상인 과제의 수로 지정하십시오 . 우리가 찾는 확률 은 적절한 값에 대해 를 더하고 결과에 곱하여 구할 수 있습니다 . $b^n$ $b$ $n$ $q(k; n, b)$ $k$ $k$ $q(k;n,b)$ $k$ $b^{-n}$

이 계수는 수백 미만 의 값에 대해 정확하게 찾을 수 있습니다 . 그러나 그들은 어떤 간단한 공식도 따르지 않을 것입니다 : 우리는 생일을 할당 할 수있는 방법의 패턴을 고려해야합니다 . 나는 일반적인 데모를 제공하는 대신 이것을 설명 할 것이다. 하자 (이것은 가장 흥미로운 흥미로운 상황이다) 가능성은 다음과 같습니다. $n$ $n = 4$

각 사람마다 독특한 생일이 있습니다. 코드는 {4}입니다.
정확히 두 사람이 생일을 공유합니다. 코드는 {2,1}입니다.
두 사람은 생일이 하나 있고 다른 두 사람은 다른 생일이 있습니다. 코드는 {0,2}입니다.
세 사람이 생일을 공유합니다. 코드는 {1,0,1}입니다.
4 명이 생일을 공유합니다. 코드는 {0,0,0,1}입니다.

일반적으로 코드 는 요소가 정확히 명의 사람들 이 공유하는 별개의 생년월일을 규정하는 수의 튜플입니다 . 따라서 특히 $\{a[1], a[2], \ldots\}$ $k^\text{th}$ $k$

1 a [1] + 2 a [2] + . . . + k a [k] + \dots = n .

$1 a[1] + 2a[2] + ... + k a[k] + \ldots = n.$

이 간단한 경우에도 생일 당 최대 두 사람이 달성 할 수있는 두 가지 방법이 있습니다. 하나는 코드 이고 다른 하나는 코드 입니다. $\{0,2\}$ $\{2,1\}$

지정된 코드에 해당하는 생일 할당 횟수를 직접 계산할 수 있습니다. 이 숫자는 세 항의 곱입니다. 하나는 다항식 계수입니다. 이것은 분할 방식의 숫자 카운트 에 명 그룹 , 의 그룹 등을. 그룹의 순서는 중요하지 않기 때문에, 우리는이 다항 계수로 분할 할 $n$ $a[1]$ $1$ $a[2]$ $2$ $a[1]!a[2]!\cdots$ ; 그 역수는 두 번째 용어입니다. 마지막으로, 그룹을 구성하고 생일마다 할당 하십시오. 첫 번째 그룹 에는 후보가 있고 두 번째 그룹에는 이 있습니다. 이 값을 곱하면 세 번째 항이됩니다. "인수 곱" 여기서 은 $b$ $b-1$ $b^{(a[1]+a[2]+\cdots)}$ $b^{(m)}$ . $b(b-1)\cdots(b-m+1)$

패턴의 카운트 관련된 명백한 및 아주 간단한 재귀있다 패턴의 카운트에 . 이를 통해 적당한 값에 대한 카운트를 빠르게 계산할 수 있습니다 . 구체적 나타낸다 정확하게 공유 생년월일 $\{a[1], \ldots, a[k]\}$ $\{a[1], \ldots, a[k-1]\}$ $n$ $a[k]$ $a[k]$ $k$ 사람들 각자. 이 후 그룹 사람들로부터 인출 된 수행 될 수있는 사람들, 별개의 방법 (예를 들어), 그 패턴을 실현하는 방법의 수를 계산하기 위해 여전히 , 나머지 사람들 중. 이것을 곱하면 재귀를 제공합니다. $a[k]$ $k$ $n$ $x$ $\{a[1], \ldots, a[k-1]\}$ $x$

대한 닫힌 형식 공식이 의심 되는데 , 최대 항이 와 같은 의 모든 파티션에 대한 개수를 합산하여 얻습니다 . 몇 가지 예를 들어 보겠습니다. $q(k; n, b)$ $n$ $k$

와 (다섯 가지 생일) 및 (사명) 우리 수득 $b=5$ $n=4$

\begin{aligned} q (1) & = q (1; 4, 5) & = 120 \\ q (2) & = 360 + 60 & = 420 \\ q (3) & = 80 \\ q (4) & = 5. \end{aligned}

$\eqalign{ q(1) &= q(1;4,5) &= 120 \\ q(2) &= 360 + 60 &= 420 \\ q(3) &&= 80 \\ q(4) &&= 5.\\ }$

Whence, for example, the chance that three or more people out of four share the same "birthday" (out of $5$ possible dates) equals $(80 + 5)/625 = 0.136$ .

다른 예로, 및 . 가장 작은 대한 의 값은 다음과 같습니다 (6 시그 만). $b = 365$ $n = 23$ $q( k;23,365)$ $k$

\begin{aligned} k = 1 : & 0.49270 \\ k = 2 : & 0.494592 \\ k = 3 : & 0.0125308 \\ k = 4 : & 0.000172844 \\ k = 5 : & 1.80449 E - 6 \\ k = 6 : & 1.48722 E - 8 \\ k = 7 : & 9.92255 E - 11 \\ k = 8 : & 5.45195 E - 13. \end{aligned}

$\eqalign{ k=1: &0.49270 \\ k=2: &0.494592 \\ k=3: &0.0125308 \\ k=4: &0.000172844 \\ k=5: &1.80449E-6 \\ k=6: &1.48722E-8 \\ k=7: &9.92255E-11 \\ k=8: &5.45195E-13. }$

Using this technique, we can readily compute that there is about a 50% chance of (at least) a three-way birthday collision among 87 people, a 50% chance of a four-way collision among 187, and a 50% chance of a five-way collision among 310 people. That last calculation starts taking a few seconds (in Mathematica, anyway) because the number of partitions to consider starts getting large. For substantially larger $n$ we need an approximation.

One approximation is obtained by means of the Poisson distribution with expectation $n/b$ , because we can view a birthday assignment as arising from $b$ almost (but not quite) independent Poisson variables each with expectation $n/b$ : the variable for any given possible birthday describes how many of the $n$ people have that birthday. The distribution of the maximum is therefore approximately $F(k)^b$ where $F$ is the Poisson CDF. This is not a rigorous argument, so let's do a little testing. The approximation for $n = 23$ , $b = 365$ gives

\begin{aligned} k = 1 : & 0.498783 \\ k = 2 : & 0.496803 \\ k = 3 : & 0.014187 \\ k = 4 : & 0.000225115. \end{aligned}

$\eqalign{ k=1: &0.498783 \\ k=2: &0.496803\\ k=3: &0.014187\\ k=4: &0.000225115. }$

By comparing with the preceding you can see that the relative probabilities can be poor when they are small, but the absolute probabilities are reasonably well approximated to about 0.5%. Testing with a wide range of $n$ and $b$ suggests the approximation is usually about this good.

To wrap up, let's consider the original question: take $n = 10,000$ (the number of observations) and $b = 1\,000\,000$ (the number of possible "structures," approximately). The approximate distribution for the maximum number of "shared birthdays" is

\begin{aligned} k = 1 : & 0 \\ k = 2 : & 0.8475 + \\ k = 3 : & 0.1520 + \\ k = 4 : & 0.0004 + \\ k > 4 : & < 1 E - 6. \end{aligned}

$\eqalign{ k=1: &0 \\ k=2: &0.8475+\\ k=3: &0.1520+\\ k=4: &0.0004+\\ k\gt 4: &\lt 1E-6. }$

(This is a fast calculation.) Clearly, observing one structure 10 times out of 10,000 would be highly significant. Because $n$ and $b$ are both large, I expect the approximation to work quite well here.

Incidentally, as Shane intimated, simulations can provide useful checks. A Mathematica simulation is created with a function like

simulate[n_, b_] := Max[Last[Transpose[Tally[RandomInteger[{0, b - 1}, n]]]]];

which is then iterated and summarized, as in this example which runs 10,000 iterations of the $n = 10000$ , $b = 1\,000\,000$ case:

Tally[Table[simulate[10000, 1000000], {n, 1, 10000}]] // TableForm

Its output is

2 8503

3 1493

4 4

These frequencies closely agree with those predicted by the Poisson approximation.

— whuber
소스

What a fantastic answer, thank you very much @whuber.

— JKnight

"There is an obvious and fairly simple recursion" — Namely?

— Kodiologist

@Kodiologist I inserted a brief description of the idea.

— whuber

+1 but where in the original question did you see that n=10000 and b=1mln? The OP looks like it is asking about n=1mln and k=10000, with b unspecified (presumably b=365). Not that it matters at this point :)

— amoeba says Reinstate Monica

@amoeba After all this time (six years, 1600 answers, and closely reading tens of thousands of posts) I cannot recall, but most likely I misinterpreted the last line. In my defense, note that if we read it literally the answer is immediate (upon applying a version of the Pigeonhole Principle): it is certain that among

n

$n$ =millions of people there will be at least one birthday that is shared among at least

x

$x$ =thousands of them!

— whuber

It is always possible to solve this problem with a monte-carlo solution, although that's far from the most efficient. Here's a simple example of the 2 person problem in R (from a presentation I gave last year; I used this as an example of inefficient code), which could be easily adjusted to account for more than 2:

birthday.paradox <- function(n.people, n.trials) {
    matches <- 0
    for (trial in 1:n.trials) {
        birthdays <- cbind(as.matrix(1:365), rep(0, 365))
        for (person in 1:n.people) {
            day <- sample(1:365, 1, replace = TRUE)
            if (birthdays[birthdays[, 1] == day, 2] == 1) {
                matches <- matches + 1
                break
            }
            birthdays[birthdays[, 1] == day, 2] <- 1
        }
        birthdays <- NULL
    }
    print(paste("Probability of birthday matches = ", matches/n.trials))
}

— Shane
소스

I am not sure if the multiple types solution will work here.

I think that generalisation still only works for 2 or more people sharing a birthday - just that you can have different sub-classes of people.

— Simon Andrews

This is an attempt at a general solution. There may be some mistakes so use with caution!

First some notation:

$P(x,n)$ be the probability that $x$ or more people share a birthday among $n$ people,

$P(y|n)$ be the probability that exactly $y$ people share a birthday among $n$ people.

Notes:

Abuse of notation as $P(.)$ is being used in two different ways.
By definition $y$ cannot take the value of 1 as it does not make any sense and $y$ = 0 can be interpreted to mean that no one shares a common birthday.

Then the required probability is given by:

$P(x,n) = 1 - P(0|n) - P(2|n) - P(3|n) .... - P(x-1|n)$

Now,

$P(y|n) = {n \choose y} (\frac{365}{365})^y \ \prod_{k=1}^{k=n-y}(1 -\frac{k}{365})$

Here is the logic: You need the probability that exactly $y$ people share a birthday.

Step 1: You can pick $y$ people in ${n \choose y}$ ways.

Step 2: Since they share a birthday it can be any of the 365 days in a year. So, we basically have 365 choices which gives us $(\frac{365}{365})^y$ .

Step 3: The remaining $n-y$ people should not share a birthday with the first $y$ people or with each other. This reasoning gives us $\prod_{k=1}^{k=n-y}(1 -\frac{k}{365})$ .

You can check that for $x$ = 2 the above collapses to the standard birthday paradox solution.

Will this solution suffer from the curse of dimensionality? If instead of n=365, n=10^6 is this solution still feasible?

— csgillespie

Some approximations may have to be used to deal with high dimensions. Perhaps, use Stirling's approximation for factorials in the binomial coefficient. To deal with the product terms you could take logs and compute the sums instead of the products and then take the anti-log of the sum.

There are also several other forms of approximations possible using for example the Taylor series expansion for the exponential function. See the wiki page for these approximations: en.wikipedia.org/wiki/Birthday_problem#Approximations

Suppose y=2, n=4, and there are just two birthdays. Your formula, adapted by replacing 365 by 2, seems to say the probability that exactly 2 people share a birthday is Comb(4,2)*(2/2)^2*(1-1/2)*(1-2/2) = 0. (In fact, it's easy to see--by brute force enumeration if you like--that the probabilities that 2, 3, or 4 people share a "birthday" are 6/16, 8/16, and 2/16, respectively.) Indeed, whenever n-y >= 365, your formula yields 0, whereas as n gets large and y is fixed the probability should increase to a non-zero maximum before n reaches 365*y and then decrease, but never down to 0.

— whuber

Why you are replacing 365 by

n

$n$ ? The probability that 2 people share a birthday is computed as: 1 - Prob(they have unique birthday). Prob(that they have unique birthday) = (364/365). The logic is as follows: Pick a person. This person can have any day of the 365 days as a birthday. The second person can then only have a birthday on one of the remaining 364 days. Thus, the prob that they have a unique birthday is 364/365. I am not sure how you are calculating 6/16.