期待の最大化アルゴリズムの動機

20

EMアルゴリズムアプローチでは、Jensenの不等式を使用して、に到達し

\log p (x | θ) \geq \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z - \int \log p (z | x, θ) p (z | x, θ^{(k)}) d z

$\log p(x|\theta) \geq \int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz - \int \log p(z|x,\theta) p(z|x,\theta^{(k)})dz$

そして、を定義します $\theta^{(k+1)}$

θ^{(k + 1)} = \arg max_{θ} \int \log p (z, x | θ) p (z | x, θ^{(k)}) d z

$\theta^{(k+1)}=\arg \max_{\theta}\int \log p(z,x|\theta) p(z|x,\theta^{(k)}) dz$

私がEMを読むことはすべてそれを単純に落としますが、EMアルゴリズムが自然に発生する理由の説明がないためにいつも不安を感じています。通常、尤度は乗算ではなく加算を処理するために処理されが、の定義でのの出現は私にはやる気がありません。他の単調関数ではなくを考慮する必要があるのはなぜですか？さまざまな理由から、期待値の最大化の背後にある「意味」または「動機」には、情報理論と十分な統計の観点から何らかの説明があると思われます。そのような説明があれば、単なる抽象的なアルゴリズムよりもはるかに満足のいくものになります。 $\log$ $\log$ $\theta^{(k+1)}$ $\log$

mixture expectation-maximization

— user782220
ソース

3

期待値最大化アルゴリズムとは何ですか？、Nature Biotechnology 26：897–899（2008）には、アルゴリズムがどのように機能するかを示す素晴らしい画像があります。

— chl

@chl：私はその記事を見ました。私が求めている点は、非ログアプローチが機能しない理由をどこにも説明していないことに注意して

— ください-user782220

10

EMアルゴリズムにはさまざまな解釈があり、さまざまなアプリケーションでさまざまな形で発生する可能性があります。

すべては、尤度関数 $p(x \vert \theta)$ 、または同等に、最大化する尤度関数始まり $\log p(x \vert \theta)$ 。（我々は、一般的には、計算を簡略化として対数を使用：これは、厳密に単調凹面、及びある $\log(ab) = \log a + \log b$ 。）理想的な世界では、値 $p$ のみに依存するモデルパラメータ $\theta$ ので、の空間を検索し、 $\theta$ 最大化するものを見つけることができます $p$ 。

ただし、多くの興味深い実世界のアプリケーションでは、すべての変数が観察されるわけではないため、事態はより複雑です。はい、私たちは直接観察するかもしれません $x$ が、他のいくつかの変数 $z$ は観察されません。そのための不足している変数 $z$ がなけれ：、我々は鶏と卵の状況の一種である $z$ 我々は、パラメータを推定することはできません $\theta$ してなくて $\theta$ 我々は、の値が何を推測することはできません $z$ かもしれません。

ここで、EMアルゴリズムが役立ちます。モデルパラメーター初期推定から開始し、 $\theta$ 欠損変数期待値を導き出します $z$ （つまり、Eステップ）。の値を $z$ 取得すると、パラメーター $\theta$ （つまり、問題ステートメントの $\arg \max$ 方程式に対応するMステップ）に対する尤度を最大化できます。この $\theta$ を使用して、 $z$ （別のEステップ）などの新しい期待値を導出できます。つまり、各ステップで、 $z$ と両方のいずれかを想定します $\theta$ 、知られている。尤度がこれ以上増加しなくなるまで、この反復プロセスを繰り返します。

これは、簡単に言えばEMアルゴリズムです。この反復EMプロセス中に尤度が決して低下しないことはよく知られています。ただし、EMアルゴリズムはグローバルな最適化を保証しないことに注意してください。つまり、尤度関数の局所的な最適化になる可能性があります。

の式でのの出現は避けられません。ここでは、最大化する関数が対数尤度として記述されているためです。 $\log$ $\theta^{(k+1)}$

— ウェイウェイ
ソース

これがどのように質問に答えているのかわかりません。

— broncoAbierto

9

尤度対対数尤度

すでに述べたように、は製品よりも合計を最適化する方が一般的に簡単であるという理由だけで、最尤で導入されます。他の単調関数を考慮しない理由は、対数が積を和に変換する特性を持つ一意の関数だからです。 $\log$

対数をやる気にさせるもう一つの方法は以下の通りである：代わりに、我々のモデルの下でのデータの確率を最大化する、我々は同等に最小化しようとすることができカルバック・ライブラー情報量データ分布と、、およびモデルの配布、、 $p_\text{data}(x)$ $p(x \mid \theta)$

D_{KL} [p_{data} (x) ∣∣ p (x ∣ θ)] = \int p_{data} (x) \log \frac{p_{data} (x)}{p (x ∣ θ)} d x = c o n s t - \int p_{data} (x) \log p (x ∣ θ) d x .

$D_\text{KL}[p_\text{data}(x) \mid\mid p(x \mid \theta)] = \int p_\text{data}(x) \log \frac{p_\text{data}(x)}{p(x \mid \theta)} \, dx = const - \int p_\text{data}(x)\log p(x \mid \theta) \, dx.$

右側の最初の項は、パラメーターが一定です。我々が持っている場合はデータ配信（当社データポイント）からのサンプルを、私たちは第二項を近似することができるデータの平均対数尤度と、 $N$

\int p_{data} (x) \log p (x ∣ θ) d x \approx \frac{1}{N} \sum_{n} \log p (x_{n} ∣ θ) .

$\int p_\text{data}(x)\log p(x \mid \theta) \, dx \approx \frac{1}{N} \sum_n \log p(x_n \mid \theta).$

EMの代替ビュー

これがあなたが探している種類の説明になるかどうかはわかりませんが、次の期待値最大化の見方は、ジェンセンの不平等による動機よりもはるかに啓発的であることがわかりました（Neal＆Hinton（1998）で詳細な説明を見つけることができます）または、Chris BishopのPRML本、9.3章）。

それを示すことは難しくありません

\log p (x ∣ θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z + D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)]

$\log p(x \mid \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz + D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)]$

任意の。我々は右辺第1項を呼び出すと、これがあることを意味します $q(z \mid x)$ $F(q, \theta)$

F (q, θ) = \int q (z ∣ x) \log \frac{p (x, z ∣ θ)}{q (z ∣ x)} d z = \log p (x ∣ θ) - D_{KL} [q (z ∣ x) ∣∣ p (z ∣ x, θ)] .

$F(q, \theta) = \int q(z \mid x) \log \frac{p(x, z \mid \theta)}{q(z \mid x)} \, dz = \log p(x \mid \theta) - D_\text{KL}[q(z \mid x) \mid\mid p(z \mid x, \theta)].$

のでKLダイバージェンスは常にポジティブである、すべての固定のための対数尤度の下限である。現在、EMはとに関してを交互に最大化するものと見なすことができます。特に、Eステップでを設定することにより、右側のKL発散を最小化し、したがってを最大化します。 $F(q, \theta)$ $q$ $F$ $q$ $\theta$ $q(z \mid x) = p(z \mid x, \theta)$ $F$

— ルーカス
ソース

投稿いただきありがとうございます！けれども与えられた文書は、対数が合計に製品を回すユニークな機能であると言っていません。それは対数が3つのリストされた特性すべてを同時に満たす唯一の機能であると言います。

— ウェイウェイ

@Weiwei: Right, but the first condition mainly requires that the function is invertible. Of course, f(x) = 0 also implies f(x + y) = f(x)f(y), but this is an uninteresting case. The third condition asks that the derivative at 1 is 1, which is only true for the logarithm to base

e

$e$ . Drop this constraint and you get logarithms to different bases, but still logarithms.

— Lucas

4

The paper that I found clarifying with respect to expectation-maximization is Bayesian K-Means as a "Maximization-Expectation" Algorithm (pdf) by Welling and Kurihara.

Suppose we have a probabilistic model $p(x,z,\theta)$ with $x$ observations, $z$ hidden random variables, and a total of $\theta$ parameters. We are given a dataset $D$ and are forced (by higher powers) to establish $p(z,\theta|D)$ .

1. Gibbs sampling

We can approximate $p(z,\theta|D)$ by sampling. Gibbs sampling gives $p(z,\theta|D)$ by alternating:

θ \sim p (θ | z, D) z \sim p (z | θ, D)

$\theta \sim p(\theta|z,D) \\ z \sim p(z|\theta,D)$

2. Variational Bayes

$q(\theta)$ $q(z)$ $p(\theta,z|D)$ $KL[q(\theta)q(z)||p(\theta,z|D)]$ we update:

q (θ) \propto \exp (E [\log p (θ, z, D)]_{q (z)}) q (z) \propto \exp (E [\log p (θ, z, D)]_{q (θ)})

$q(\theta) \propto \exp (E [\log p(\theta,z,D) ]_{q(z)} ) \\ q(z) \propto \exp (E [\log p(\theta,z,D) ]_{q(\theta)} )$

3. Expectation-Maximization

To come up with full-fledged probability distributions for both $z$ and $\theta$ might be considered extreme. Why don't we instead consider a point estimate for one of these and keep the other nice and nuanced. In EM the parameter $\theta$ is established as the one being unworthy of a full distribution, and set to its MAP (Maximum A Posteriori) value, $\theta^*$ .

θ^{*} = \underset{θ}{argmax} E [\log p (θ, z, D)]_{q (z)} q (z) = p (z | θ^{*}, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(z)} \\ q(z) = p(z|\theta^*,D)$

Here $\theta^* \in \operatorname{argmax}$ would actually be a better notation: the argmax operator can return multiple values. But let's not nitpick. Compared to variational Bayes you see that correcting for the $\log$ by $\exp$ doesn't change the result, so that is not necessary anymore.

4. Maximization-Expectation

There is no reason to treat $z$ as a spoiled child. We can just as well use point estimates $z^*$ for our hidden variables and give the parameters $\theta$ the luxury of a full distribution.

z^{*} = \underset{z}{argmax} E [\log p (θ, z, D)]_{q (θ)} q (θ) = p (θ | z^{*}, D)

$z^* = \underset{z}{\operatorname{argmax}} E [\log p(\theta,z,D) ]_{q(\theta)} \\ q(\theta) = p(\theta|z^*,D)$

If our hidden variables $z$ are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).

5. Iterated conditional modes

Of course, the poster child of approximate inference is to use point estimates for both the parameters $\theta$ as well as the observations $z$ .

θ^{*} = \underset{θ}{argmax} p (θ, z^{*}, D) z^{*} = \underset{z}{argmax} p (θ^{*}, z, D)

$\theta^* = \underset{\theta}{\operatorname{argmax}} p(\theta,z^*,D) \\ z^* = \underset{z}{\operatorname{argmax}} p(\theta^*,z,D) \\$

To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a $k$ -means alternative, but this lucid and concise exposition of approximation.

— Anne van Rossum
ソース

(+1) this is a beautiful summary of all methods.

— kedarps

4

There is a useful optimisation technique underlying the EM algorithm. However, it's usually expressed in the language of probability theory so it's hard to see that at the core is a method that has nothing to do with probability and expectation.

Consider the problem of maximising

g (x) = \sum_{i} \exp (f_{i} (x))

$g(x)=\sum_i\exp(f_i(x))$ (or equivalently

\log g (x)

$\log g(x)$ ) with respect to

x

$x$ . If you write down an expression for

g^{'} (x)

$g'(x)$ and set it equal to zero you will often end up with a transcendental equation to solve. These can be nasty.

Now suppose that the $f_i$ play well together in the sense that linear combinations of them give you something easy to optimise. For example, if all of the $f_i(x)$ are quadratic in $x$ then a linear combination of the $f_i(x)$ will also be quadratic, and hence easy to optimise.

Given this supposition, it'd be cool if, in order to optimise $\log g(x)=\log \sum_i\exp(f_i(x))$ we could somehow shuffle the $\log$ past the $\sum$ so it could meet the $\exp$ s and eliminate them. Then the $f_i$ could play together. But we can't do that.

Let's do the next best thing. We'll make another function $h$ that is similar to $g$ . And we'll make it out of linear combinations of the $f_i$ .

Let's say $x_0$ is a guess for an optimal value. We'd like to improve this. Let's find another function $h$ that matches $g$ and its derivative at $x_0$ , i.e. $g(x_0)=h(x_0)$ and $g'(x_0)=h'(x_0)$ . If you plot a graph of $h$ in a small neighbourhood of $x_0$ it's going to look similar to $g$ .

You can show that

g^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x)) .

$g'(x)=\sum_i f_i'(x)\exp(f_i(x)).$ We want something that matches this at

x_{0}

$x_0$ . There's a natural choice:

h (x) = constant + \sum_{i} f_{i} (x) \exp (f_{i} (x_{0})) .

$h(x)=\mbox{constant}+\sum_i f_i(x)\exp(f_i(x_0)).$ You can see they match at

x = x_{0}

$x=x_0$ . We get

h^{'} (x) = \sum_{i} f_{i}^{'} (x) \exp (f_{i} (x_{0})) .

$h'(x)=\sum_i f_i'(x)\exp(f_i(x_0)).$ As

x_{0}

$x_0$ is a constant we have a simple linear combination of the

f_{i}

$f_i$ whose derivative matches

g

$g$ . We just have to choose the constant in

h

$h$ to make

g (x_{0}) = h (x_{0})

$g(x_0)=h(x_0)$ .

So starting with $x_0$ , we form $h(x)$ and optimise that. Because it's similar to $g(x)$ in the neighbourhood of $x_0$ we hope the optimum of $h$ is similar to the optimum of g. Once you have a new estimate, construct the next $h$ and repeat.

I hope this has motivated the choice of $h$ . This is exactly the procedure that takes place in EM.

But there's one more important point. Using Jensen's inequality you can show that $h(x)\le g(x)$ . This means that when you optimise $h(x)$ you always get an $x$ that makes $g$ bigger compared to $g(x_0)$ . So even though $h$ was motivated by its local similarity to $g$ , it's safe to globally maximise $h$ at each iteration. The hope I mentioned above isn't required.

This also gives a clue to when to use EM: when linear combinations of the arguments to the $\exp$ function are easier to optimise. For example when they're quadratic - as happens when working with mixtures of Gaussians. This is particularly relevant to statistics where many of the standard distributions are from exponential families.

— Dan Piponi
ソース

3

As you said, I will not go into technical details. There are quite a few very nice tutorials. One of my favourites are Andrew Ng's lecture notes. Take a look also at the references here.

EM is naturally motivated in mixture models and models with hidden factors in general. Take for example the case of Gaussian mixture models (GMM). Here we model the density of the observations as a weighted sum of $K$ gaussians:
$p (x) = \sum_{i = 1}^{K} π_{i} N (x | μ_{i}, Σ_{i})$ $p(x) = \sum_{i=1}^{K}\pi_{i} \mathcal{N}(x|\mu_{i}, \Sigma_{i})$ where $\pi_{i}$ is the probability that the sample $x$ was caused/generated by the ith component, $\mu_{i}$ is the mean of the distribution, and $\Sigma_{i}$ is the covariance matrix. The way to understand this expression is the following: each data sample has been generated/caused by one component, but we do not know which one. The approach is then to express the uncertainty in terms of probability ( $\pi_{i}$ represents the chances that the ith component can account for that sample), and take the weighted sum. As a concrete example, imagine you want to cluster text documents. The idea is to assume that each document belong to a topic (science, sports,...) which you do not know beforehand!. The possible topics are hidden variables. Then you are given a bunch of documents, and by counting n-grams or whatever features you extract, you want to then find those clusters and see to which cluster each document belongs to. EM is a procedure which attacks this problem step-wise: the expectation step attempts to improve the assignments of the samples it has achieved so far. The maximization step you improve the parameters of the mixture, in other words, the form of the clusters.
The point is not using monotonic functions but convex functions. And the reason is the Jensen's inequality which ensures that the estimates of the EM algorithm will improve at every step.

— jpmuc
ソース