Bayesian Hypothesis Testingは、推論および決定理論の枠組みで何を意味しますか？

私の背景は主に機械学習であり、ベイジアン仮説検定の意味を学ぼうとしていました。確率のベイジアン解釈に問題はなく、確率的グラフィカルモデルのコンテキストでそれを熟知しています。しかし、私を混乱させているのは、統計的推論の文脈で「仮説」という言葉が意味するものです。

機械学習でよく使われる語彙と、統計や推論で通常使用される語彙について、ほとんど混乱していると思います。

教師あり学習の文脈では、私は通常、仮説を例にそのラベルにマップする予測関数、つまりと考えます $h:\mathcal{X} \rightarrow \mathcal{Y}$ 。しかし、私がしているリーディングでは、仮説という用語は同じ意味を持たないように思われます。読んでいる読みの抜粋を貼り付けてみましょう。

ここに画像の説明を入力してください

注意深く読むと、次のようにも表示されます。

観測データには異なるモデルがあります...

彼らは単語モデルを使用していました。私にとって、モデルという言葉は、特定の予測関数を選択した場合の一連の関数を考えさせてくれます。すなわち、関数の仮説クラス。たとえば、は次関数の仮説クラス（次数2の多項式）です。ただし、この抽出では同義語として単語モデルと仮説を使用しているように思われます（私にとってはまったく異なる単語です）。 $\mathcal{H_{d2}}$

次に、仮説に優先順位を付けることができることに言及します（ベイジアン設定で行うには完全に合理的なことです）：

p_{H} (H_{m}), m = {0, 1, . . ., M - 1}

$p_H(H_m), \ \ \ \ \ m=\{0, 1, ..., M-1 \}$

また、現在の仮説でデータを特徴付けることができます。

p_{y | H} (\cdot | H_{m}), m = {0, 1, . . ., M - 1}

$p_{y|H}( \cdot |H_m), \ \ \ \ \ m=\{0, 1, ..., M-1 \}$

いくつかのデータ（およびBayeのルール）を与えられた現在の信念を更新します。

p_{H | y} (H_{m} | y), m = {0, 1, . . ., M - 1}

$p_{H|y}(H_m|y), \ \ \ \ \ m=\{0, 1, ..., M-1 \}$

ただし、仮説クラス全体ではなく、仮説クラスの特定のパラメーター（）にベイズ推定を行うことに慣れていると思います。基本的に、これらの「仮説」は私が慣れている機械学習のコンテキストからの仮説とは異なるように思われるため、これらの仮説は仮説クラスよりも特定のパラメータにより似ているように思われます。 $\theta$ $\theta$

この時点で、「仮説」は予測関数と同じことを意味すると確信しました（たとえば、パラメーターによってパラメーター化されます）が、私は間違っていたと思います... $\theta$

私の混乱をさらに悪化させるために、後にこれらの同じ読書は、彼らが観察した各訓練例に特定の「仮説」を指定するために先に進みました。意味の抜粋を貼り付けてみましょう。

ここに画像の説明を入力してください

これが私を混乱させる理由は、仮に仮説をパラメーターとして解釈する場合、私が見る各サンプル値に対して特定のパラメーターを指定することは意味がないからです。この時点で、彼らが仮説によって何を意味するのか本当にわからないと結論付けたので、この質問を投稿しました。

しかし、私は完全にあきらめなかった、私は頻繁な統計で仮説が何を意味するかを研究し、次のカーンアカデミービデオを見つけました。そのビデオは実際に私にとって非常に理にかなっています（多分あなたは頻繁にいる！:)。ただし、多くのデータ（「サンプルセット」など）を取得し、サンプルセットのプロパティに基づいて、データに関する帰無仮説を受け入れるか拒否するかを決定しているようです。しかし、私が読んでいるベイジアンの文脈では、観測された各データ[点] ベクトルについて、「尤度比検定」の仮説で「ラベル付け」しているように思われます。

ここに画像の説明を入力してください

各データサンプルに仮説を割り当てる方法は、各トレーニングセットにラベルを付けると、教師あり学習設定のようにも見えます。しかし、私は彼らがこの文脈でやっていることだとは思わない。彼らは何をしていますか？各データサンプルに仮説を割り当てるとはどういう意味ですか？仮説の意味は何ですか？単語モデルとはどういう意味ですか？

基本的に、この混乱についての長い説明の後、誰かがこの文脈でベイジアン仮説検定の意味を知っていますか？

私の質問を改善するために、または質問が理にかなっているように明確化または何かが必要な場合、私は喜んで助けます:)

答えを探して、統計的仮説検定に関連するいくつかの有用なものを見つけました。

これは、CSのバックグラウンド（私のような）から来た場合のトピックの優れた紹介です。

コンピューター科学者向けの統計的仮説検定の良い紹介は何ですか？

ある時点で、「デフォルトのパラメーター」について質問しました（意味を定義しておくべきでした。標準用語だと思っていましたが、そうではないので、ここで説明します）。持っている仮説ごとにパラメーターを指定します。たとえば、帰無仮説とそのパラメーターをどのように決定しますか。それに関連する質問があります：

仮説検定で帰無仮説を指定する方法

hypothesis-testing

— ピノキオ
ソース

@ Xi'an私は次のウィキペディアの記事を読みました：en.wikipedia.org/wiki/Statistical_modelは、モデルと仮説によってそれらが意味するものですか？忍耐のためのthnx btw :)

— ピノキオ14

あなたの問題は、ベイズの枠組みで仮説検定が具体的に何であるかではなく、仮説検定が何を意味するかを理解することであると私は思うので、私はこの議論にたどり着きません。これを支援するために、ガイザーによる本「パラメトリック統計的推論のモード」をご覧になることをお勧めします。books.google.ca/...

— Rocinanteの

@rocinante私はあなたに同意すると思います。私は一般に仮説検定について明確に混乱しています（そして、ベイジアンフレームワークはまったく役に立ちません）。私は間違いなくそれを見ていきます。ご理解とご協力に感謝します。

— ピノキオ14

簡潔に表現するのは簡単なことではないため、理解するのは簡単なことではありません。これを抽象的な用語（マップなど）で考えるのではなく、もっと簡単な例で考えると役立つかもしれません

— 。1

2/2コインがあり、それが公正かどうかを確認したいので、50回裏返します。これで、推測したいデータセットができました（つまり、コインに偏りがあるかどうか）。論理的には、コインが公正であれば、トスの約半分が頭になるはずです。（これは統計の派生ではなく、独自の論理的推論であることに注意してください）。それがあなたの仮説です。この仮説は2つの方法でテストできます：ベイズの方法と頻度の高い方法です。

— Rocinanteの

回答:

統計モデルは、確率分布の家族によって与えられています。モデルはパラメトリックである場合、このファミリは、未知のパラメータによってインデックスされる： 1は、上の仮説をテストしたい場合はのような $\theta$

F = {f (\cdot | θ); θ \in Θ}

$\mathcal{F}=\left\{ f(\cdot|\theta);\ \theta\in\Theta \right\}$

θ

$\theta$

：、1は2つのモデルが対立していると考えることができ

対

から、私のベイズ視点は、私は、データの背後にあるモデルのインデックスに推論を描いています

。そこで私は、このインデックスに、前を入れ

と

、だけでなく、両方のモデルのパラメータに

H_{0} : θ \in Θ_{0}

$H_0:\,\theta\in\Theta_0$

F

$\mathcal{F}$

F_{0} = {f (\cdot | θ); θ \in Θ_{0}}

$\mathcal{F}_0=\left\{ f(\cdot|\theta);\ \theta\in\Theta_0 \right\}$

M

$\mathcal{M}$

ρ_{0}

$\rho_0$

ρ_{a}

$\rho_a$

を超える

と

π_{0} (θ)

$\pi_0(\theta)$

Θ_{0}

$\Theta_0$

上の

。：私は、この指標の事後分布推定

π_{a} (θ)

$\pi_a(\theta)$

Θ

$\Theta$

ドキュメントあなたはにリンクされはるかに詳細に入りますベイジアンの本全体を読み通す余裕がない限り、仮説の統計的検定へのあなたの選択のエントリーである必要があります。または機械学習の本でさえ

π (m = 0 | x) = \frac{ρ_{0} \int_{Θ_{0}} f (x | θ) π_{0} (θ) d θ}{ρ_{0} \int_{Θ_{0}} f (x | θ) π_{0} (θ) d θ + (1 - ρ_{0}) \int_{Θ} f (x | θ) π_{a} (θ) d θ}

$\pi(m=0|x)=\dfrac{\rho_0\int_{\Theta_0} f(x|\theta)\pi_0(\theta)\text{d}\theta}{\rho_0\int_{\Theta_0} f(x|\theta)\pi_0(\theta)\text{d}\theta +(1-\rho_0)\int_{\Theta} f(x|\theta)\pi_a(\theta)\text{d}\theta}$ Kevin Murphyのように。

$X\sim\mathcal{N}(\theta,1)$ $H_0:\theta=0$ $\theta=0$ $\mathcal{N}(0,1)$ $\theta$ $\theta\sim\mathcal{N}(0,10)$ $\rho_0=1/2$

\begin{aligned} π (m = 0 | x) & = \frac{\frac{1}{\sqrt{2 π}} \exp {- x^{2} / 2}}{\frac{1}{\sqrt{2 π}} \exp {- x^{2} / 2} + \int_{R} \frac{1}{\sqrt{2 π}} \exp {- (x - θ)^{2} / 2} \frac{1}{\sqrt{2 π \times 10}} \exp {- θ^{2} / 20} d θ} \\ = \frac{\exp {- x^{2} / 2}}{\exp {- x^{2} / 2} + \frac{1}{\sqrt{11}} \exp {- x^{2} / 22}} \end{aligned}

$\begin{align*}\pi(m=0|x)&=\dfrac{\frac{1}{\sqrt{2\pi}}\exp\{-x^2/2\}}{\frac{1}{\sqrt{2\pi}}\exp\{-x^2/2\} +\int_{\mathbb{R}} \frac{1}{\sqrt{2\pi}}\exp\{-(x-\theta)^2/2\}\frac{1}{\sqrt{2\pi\times10}}\exp\{-\theta^2/20\}\text{d}\theta}\\ &=\dfrac{\exp\{-x^2/2\}}{\exp\{-x^2/2\} +\frac{1}{\sqrt{11}}\exp\{-x^2/22\}} \end{align*}$

— Xi'an
ソース

sorry if my question is a little repetitive, but I am still unsure what hypothesis means, at least in the context of the document I provided. When a probability

p_{H} (H_{0})

$p_{H}(H_0)$ is assigned, does it mean that we have an uncertainty (in your notation from your answer) on

F_{0}

$\mathcal{F}_0$ or on specific

θ \in F_{0}

$\theta \in \mathcal{F}_0$ ? For example, when it says

p_{y | H} (y | H_{0})

$p_{y|H}( y |H_0)$ , does it mean what the likelihood of some specific data y given that the data describing it comes from the family of distributions as specified by

H_{0}

$H_0$ or some specific parameter specified from

H_{0}

$H_0$ ?

— Pinocchio

Or maybe a hypothesis

H_{m}

$H_m$ indicates both (as a pair), a specific parametrization

θ

$\theta$ from a specific family

F_{m}

$\mathcal{F}_{m}$ . i.e.

H_{m} = (θ, F_{m})

$H_m = (\theta, \mathcal{F}_m)$ where

θ \in F_{m}

$\theta \in \mathcal{F}_m$ . Btw, I appreciate your time and help immensely. Thnx :)

— Pinocchio

The pair you mention is the (model index,parameter value), both of which are endowed with prior probabilities. So

ϱ_{0}

$\varrho_0$ is the prior probability or belief that the model

H_{0}

$H_0$ (or

F_{0}

$\mathcal{F}_0$ ) is the right one (with the default choice

ϱ_{0} = 0

$\varrho_0=0$ ) and

π_{0} (θ)

$\pi_0(\theta)$ is the prior distribution on the parameter

θ

$\theta$ of the model under

H_{0}

$H_0$ .

— Xi'an

so if the a hypothesis is a tuple of a proposed statistical model and a default parameter, how is the default parameter chosen?

— Pinocchio

I do not understand what you mean by "default parameter": an hypothesis is either a model with all parameters fixed to known values (like

θ = 0

$\theta=0$ in the above example) or with some parameters unknown. In the later case, a Bayesian approach implies putting prior distributions on those unknowns.

— Xi'an

Excellent question. I think your confusion may result from some of the basic differences between the "frequentist" and "Bayesian" perspectives. I have a lot of experience with the former and am new to the later so attempting a few simple observations might help me too. I edited your question to make a few distinctions clear - at least, as I understand them. I hope you don't mind! If I got something wrong, you could re-edit your question or add a comment on this response.

1) At the risk of sounding somewhat too elementary: A model is any statement that attempts an explanation of reality like "If I had pancakes for breakfast, it must be Tuesday." As such, a model is an hypothesis. A famous quote by George Box: "All models are wrong, some models are useful." For a model to be useful there must be some way to test it. Enter the concept of competing hypotheses and the answer to one of your questions. I would suggest that "...in the context of statistical inference," an hypothesis is any model that may be useful and can be tested mathematically. So hypothesis testing is a means of making a decision about whether a model is useful of not. In summary, an hypothesis is a model under consideration. It could be different parameter values of the same function or different functions. I think your lecture notes are showing that different outcomes (measurements) in the sample space would make different hypotheses (Is the intercept parameter zero? Do I need a cube in that polynomial? Maybe it's really exponential?), more or less likely.

2) Your Kahn video is an example of what Bayesian's call the "Frequentist" approach to hypothesis testing so it may have confused you when trying to apply it to your lecture notes which are Bayesian. I have been trying to come up with a simple distinction between application of the two approaches (which may be dangerous). I think I understand the philosophical distinction reasonably well. From what I have seen, the "Frequentist" assumes a random component to the data and tests how likely the observed data are given non-random parameters. The "Bayesian" assumes the data are fixed and determines the most likely value of random parameters. This difference leads to different testing methods.

In "Frequentist" hypothesis testing, a model that may be useful is one which explains some effect so it is compared with the "null hypothesis" - the model of no effect. The attempt is made to set up a useful model that is mutually exclusive to the model of no effect. The test is then on the probability of observing the data under the assumption of no effect. If that probability is found to be low, the null hypothesis is rejected and the alternative is all that's left. (Note that a purist would never "accept" the null hypothesis, only "fail to reject" one. It may sound like angels dancing on the head of a pin but the distinction is a fundamental philosophical one) Intro statistics usually starts with what may be the simplest example: "Two groups are different." The null hypothesis that they are not different is tested by calculating how likely it would be to observe differences as great or greater as measured by a random experiment given that they are not different. This is usually a t-test where the null hypothesis is that the difference of the means is zero. So the parameter is the mean at a fixed value of zero.

The Bayesian says, "Hold on a minute, we made those measurements and they are different, so how likely is that?" They calculate the probability for every value of the (now) random parameter and pick the one that is highest as the most likely. So in a sense, every possible value of the parameter is a separate model. But now they need a way to make a decision about whether the model with the highest probability is different enough to matter. That's why your lecture notes introduced the cost function. To make a good decision, some assumption of the consequences of making the wrong decision is needed.

3) "What does it mean to assign a hypothesis to each data sample?" I don't think they are. Be careful with what is meant by "sample point." I believe they are referring to a particular sample vector and want to know how likely each hypothesis is for all sample vectors in the sample space. Equations (14) and (15) show how to compare two hypotheses for a particular sample vector. So they are simplifying a general argument of comparing multiple hypotheses by showing how to compare only two.

— M T
ソース

Say you have data from a set of boxes. The data consists of Length (L), Width (W), Height (H), and Volume (V).

If we don't know much about boxes/geometry we might try the model:

V = a*L + b*W + c*H + e

This model has three parameters (a, b, c) that could be varied, plus an error/cost term (e) describing how well the hypothesis fits the data. Each combination of parameter values would be considered a different hypothesis. The "default" parameter value chosen is usually zero, which in the above example would correspond to "no relationship" between V and L, W, H.

What people do is test this "default" hypothesis by checking if e is beyond some cutoff value, usually by calculating a p-value assuming a normal distribution of error around the model fit. If that hypothesis is rejected, then they find the combination of a, b, c parameters that maximizes the likelihood and present this is the most likely hypothesis. If they are bayesian they multiply the likelihood by the prior for each set of parameter values and choose the solution that maximizes the posterior probability.

Obviously this strategy is non-optimal in that the model assumes additivity, and will miss that the correct hypothesis is:

V = L*W*H + e

Edit: @Pinocchio

Perhaps someone disagreed with the claim that hypothesis testing is non-optimal when there is no rational reason to choose one/few functions (or as you put it: "hypothesis classes") out of the infinitely many possible . Of course this is trivially true, and "optimal" can be used in the limited sense of "best fit given the cost function and choices supplied". That comment made it into my answer because I disliked how the issue of model specification was glossed over in your class notes. It is the main problem facing most scientific workers, for which afaik there is no algorithm.

Further, I could not understand p-values, hypothesis testing, etc until I understood the history, so perhaps it will help you as well. There are multiple sources of confusion surrounding frequentist hypothesis testing (I am not so familiar with the history of the bayesian variant).

There is what was originally called "hypothesis testing" in the Neyman-Pearson sense, "significance testing" as developed by Ronald Fisher, and also an ill defined, never properly justified "hybrid" of these two strategies widely used throughout the sciences (which may be casually referred to using either above term, or "null hypothesis significance testing"). While I wouldn't recommend taking a wikipedia page as authoritative, many sources discussing these issues can be found here. Some main points:

The use of a "default" hypothesis is not part of the original hypothesis testing procedure, rather the user is supposed to use prior knowledge to determine the models under consideration. I have never seen explicit recommendation by proponents of this model regarding what to do if we have no particular reason to choose a given set of hypotheses to compare. It is often said that this approach is suitable for quality control, when there are known tolerances to compare some measurement to.
There is no alternative hypothesis under Fisher's "significance testing" paradigm, only a null hypothesis, which can be rejected if deemed unlikely given the data. From my reading, Fisher himself was equivocal on the use of default null hypotheses. I could never find him commenting explicitly on the matter, however he surely did not recommend that this should be the only null hypothesis.
The use of the default null hypothesis is sometimes construed as an "abuse" of hypothesis testing, but it is central to the popular hybrid method mentioned. The argument goes that this practice is often "a useless preliminary":

"The researcher formulates a theoretical prediction, generally the direction of an effect... When the data in fact show the predicted directional result, this seems to confirm the hypothesis. The researcher tests a 'straw person' null hypothesis that the effect is actually zero. If the latter cannot be rejected at the .05 level (or some variant), then the apparent confirmation of the theory cannot be claimed...A common error in this type of test is to confuse the significance level actually attained (for rejecting the straw-person null) with the confirmation level attained for the original theory... the strength of confirmation actually depends on [the sharpness of a researcher's numerical predictions], not on the significance level attained for a straw-person null."

The null hypothesis testing controversy in psychology. David H Krantz. Journal of the American Statistical Association; Dec 1999; 94, 448; 1372-1381

The Khan academy video is an example of this hybrid method, and is guilty of committing the error noted in that quote. From the information available in that video we can only conclude that the injected rats differ from the non-injected, while the video claims we can conclude "the drug definitely has some effect". A bit of reflection would lead us to consider that perhaps the tested rats were older than the non-injected, etc. We need to rule out plausible alternative explanations before claiming evidence for our theory. The less specific the prediction of the theory, the more difficult it is to accomplish this.

Edit 2:

Perhaps taking the example from your notes of a medical diagnosis will help. Say a patient can be either "normal" or in "hypertensive crisis".

We have prior information that only 1% of people are in hypertensive crisis. People in hypertensive crisis have systolic blood pressure that follows a normal distribution with mean=180 and sd=10. Meanwhile, normal people have blood pressure from a normal distribution with mean=120, sd=10. The cost of judging a person normal when they are is zero, the cost of missing a diagnosis is 1, and the cost due to side effects due to the treatment is 0.2 regardless of whether they are in crisis or not. Then the following R code calculates the threshold (eta) and likelihood ratio. If the likelihood ratio is greater than the threshold we decide to treat, if less than we do not:

#Prior probabilities
P0=.99 #Prior probability patient is normal
P1=1-P0 #Prior probability patient is in crisis

#Hypotheses
H0<-dnorm(x=50:250, mean=120, sd=10) #H0: Patient is normal
H1<-dnorm(x=50:250, mean=180, sd=10) #H1: Patient in hypertensive crisis

#Costs
C00=0 #Decide normal when normal
C01=1 #Decide normal when in crisis
C10=.2 #Decide crisis when normal
C11=.2 #Decide crisis when in crisis

#Threshold
eta=P0*(C10-C00)/ P1*(C01-C11)

#Blood Pressure Measurements
y<-rnorm(3, 150, 20)

#Calculate Likelihood of Each Datapoint Given Each Hypothesis
L0vec=dnorm(x=y, mean=120, sd=10) #Vector of Likelihoods under H0
L1vec=dnorm(x=y, mean=180, sd=10) #Vector of Likelihoods under H1

#P(y|H) is the product of the likelihoods under each hypothesis
L0<-prod(L0vec)
L1<-prod(L1vec)

#L(y) is the ratio of the two likelihoods
LikRatio<-L1/L0


#Plot
plot(50:250, H0, type="l", col="Green", lwd=4, 
     xlab=" Systolic Blood Pressure", ylab="Probability Density Given Model",
     main=paste0("L=",signif(LikRatio,3)," eta=", signif(eta,3)))
lines(50:250, H1, col="Red", lwd=4)
abline(v=y)

#Decision
if(LikRatio>eta){
  print("L > eta  ---> Decision: Treat Patient")
}else{
  print("L < eta  ---> Do Not Treat Patient")
}

In the above scenario the threshold eta=15.84. If we take three blood pressure measurements and get 139.9237, 125.2278, 190.3765, then the likelihood ratio is 27.6 in favor of H1: Patient in hypertensive crisis. Since 27.6 is greater than than the threshold we would choose to treat. The graph shows the normal hypothesis in green and hypertensive in red. Vertical black lines indicate the values of the observations.

enter image description here

— Livid
ソース

can the person that down voted this explain? Whats wrong with this answer? :S

— Pinocchio

@Pinocchio I have attempted to clarify things with some history in the answer, "hypothesis testing" is a difficult subject to discuss clearly due to that. I think I have answered the questions regarding how the terms model/ hypothesis are used but do not understand this one: 'What does it mean to assign a hypothesis to each data sample?'

— Livid

I can't understand why this answer was downvoted, and why it's not more upvoted. It is truly excellent. It could use a bit more of theoretical definitions, but it is clearly oriented towards a broader audience than statisticians. The first example using a GLM was particularly enlightening and totally in line with my (numerous) academic readings. The bottom line is that the main difference between frequentist and bayesian hypothesis testing is the accounting of the prior in order to compute the MAP (instead of only the MLE).

— gaborous

I might add that a graphical representation of the first example with the GLM would be awesome and very enlightening, maybe using a kind of leverage plot?

— gaborous