PCA最適化は凸型ですか？

主成分分析（PCA）の目的関数は、セクション2.12を参照してください（L2ノルムに再構成誤差を最小化され、ここで。別のビューが投影上の分散を最大化しようとしている我々はまた、ここでは優れた記事があります。PCAの目的関数とは何ですか？）。

私の質問は、PCA最適化は凸ですか？（ここでいくつかの議論を見つけましたが、CVで誰かがここで素晴らしい証拠を提供できることを願っています）。

— ハイタオドゥ
ソース

いいえ。凸関数を最大化しています（制約下）。

— user603

「PCA最適化」とはどういうことか具体的に説明する必要があると思います。一つの標準調合物を最大にすることである

x^{'} A x

$x^\prime\mathbb{A}x$ 対象

x^{'} x = 1

$x^\prime x=1$ 。問題は、凸性が意味をなさないことです。ドメイン

x^{'} x = 1

$x^\prime x=1$ は球体であり、ユークリッド空間ではありません。

— whuber

@whuberコメントありがとうございます。知識が限られているため、質問を明確にできない場合があります。質問を明確にするのに役立ついくつかの回答を待つことができます。

— ハイタオデュ

私はあなたがよく知っている「凸」の定義を参照するでしょう。それらはすべて、他のポイントの「間にある」関数のドメイン内のポイントの概念に関係していませんか？関数のドメインの幾何学、および関数値の代数的または分析的なプロパティを考慮することを思い出させるので、それは覚えておく価値があります。その光の中で、それは分散最大化製剤はわずかにドメイン凸面を作製するために改変することができることを私に発生します単に必要と

x^{'} x \leq 1

$x^\prime x\le1$ はなく

x^{'} x = 1

$x^\prime x=1$ 。解決策は同じです。そして、答えは非常に明確になります。

— whuber

いいえ、PCAの通常の製剤があるではない凸の問題。 しかし、それらは凸最適化問題に変換できます。

この洞察と楽しみは、単に答えを得るのではなく、変換のシーケンスを追跡して視覚化することです。それは目的地ではなく旅にあります。この旅の主なステップは

目的関数の簡単な式を取得します。
凸ではない領域を、ある領域に拡大します。
凸状ではない対物レンズを、次のように変更します。最適値に到達するポイント明らかに変更しないます。

よく見ると、SVDとラグランジュの乗数が潜んでいるのがわかります。

PCAの標準的な分散最大化定式化（または少なくともその重要なステップ）は次のとおりです。

\begin{matrix} (*) & Maximize f (x) = x^{'} A x subject to x^{'} x = 1 \end{matrix}

$\text{Maximize }f(x)=\ x^\prime \mathbb{A} x\ \text{ subject to }\ x^\prime x=1\tag{*}$

ここで、 $n\times n$ 行列 $\mathbb A$ は、データ（通常は、その平方和と積の行列、その共分散行列、またはその相関行列）から構築された対称な正半有限行列です。

（同様に、制約のない目標を最大化しようとする場合があります。これは厄介な表現であるだけでなく、二次関数ではなくなりますが、特殊なケースをグラフ化すると、凸関数ではないことがすぐにわかります、のいずれか。通常、1つは、この関数が不変であるrescalings下で観察、次に拘束製剤にそれを減少させます）。 $x^\prime \mathbb{A} x / x^\prime x$ $x\to \lambda x$ $(*)$

最適化の問題は、次のように抽象的に定式化できます。

少なくとも一つの検索の関数を作ることできるだけ大きく。 $x\in\mathcal{X}$ $f:\mathcal{X}\to\mathbb{R}$

最適化の問題は、2つの別個の特性を享受している場合に凸になることを思い出してください。

ドメイン 凸状です。 $\mathcal{X}\subset\mathbb{R}^n$ これは多くの方法で定式化できます。一つは、そのたびにとと、も。幾何学的：線分セグメントの2つの端点がにあるときは常に、セグメント全体がます。 $x\in\mathcal{X}$ $y\in\mathcal{X}$ $0 \le \lambda \le 1$ $\lambda x + (1-\lambda)y\in\mathcal{X}$ $\mathcal X$ $\mathcal X$
関数凸状です。 $f$ これも多くの方法で定式化できます。一つは、そのたびにとと、（必要でした $x\in\mathcal{X}$ $y\in\mathcal{X}$ $0 \le \lambda \le 1$
$f (λ x + (1 - λ) y) \geq λ f (x) + (1 - λ) f (y) .$ $f(\lambda x + (1-\lambda)y) \ge \lambda f(x) + (1-\lambda) f(y).$ $\mathcal X$ いかなる意味をなすために、この条件のために、凸状に）幾何学：たび内の任意の線分であるのグラフこのセグメントに限定されるよう（）接続上またはセグメント上にある及びにおける。 $\bar{xy}$ $\mathcal X$ $f$ $(x,f(x))$ $(y,f(y))$ $\mathbb{R}^{n+1}$
凸関数の原型はどこでも局所的、非正主係数を有する放物線である：任意の線分上には、フォーム内で発現させることができると $y\to a y^2 + b y + c$ $a \le 0.$

困難を有することである単位が球である明らかに凸状ではありません、。 $(*)$ $\mathcal X$ $S^{n-1}\subset\mathbb{R}^n$ ただし、より小さいベクトルを含めることでこの問題を修正できます。我々は縮尺ときからである係数により、乗算される。場合、我々は拡張することができまでの単位長さにを乗じて $x$ $\lambda$ $f$ $\lambda^2$ $0 \lt x^\prime x \lt 1$ $x$ 、それによって増加なく、単位球内に留まる。私たちはそのため定式ましょうとして $\lambda=1/\sqrt{x^\prime x} \gt 1$ $f$ $D^n = \{x\in\mathbb{R}^n\mid x^\prime x \le 1\}$ $(*)$

\begin{matrix} (**) & Maximize f (x) = x^{'} A x subject to x^{'} x \leq 1 \end{matrix}

$\text{Maximize }f(x)=\ x^\prime \mathbb{A} x\ \text{ subject to }\ x^\prime x\le1\tag{**}$

その領域はあり、明らかに凸であるため、途中にいます。のグラフの凸性を考慮することは残っています。 $\mathcal{X}=D^n$ $f$

問題について考える良い方法は、たとえ対応する計算を実行するつもりがないとしても、スペクトル定理の観点からです。 $(**)$ それは、直交変換によって、が対角である少なくとも1つの基底を見つけることができると言います：つまり、 $\mathbb P$ $\mathbb{R}^n$ $\mathbb A$

A = P^{'} Σ P

$\mathbb {A = P^\prime \Sigma P}$

ここで、すべての非対角要素はゼロです。このような選択は、についてまったく何も変更しないと考えることができますが、それを記述する方法を変更するだけです。視点を回転させると、関数（これのレベル超曲面常に楕円体でした）座標軸に合わせます。 $\Sigma$ $\mathbb{P}$ $\mathbb A$ $x\to x^\prime \mathbb{A} x$

以来、正半正定値である、すべての対角エントリ非負でなければなりません。 我々はさらに、（ちょうど別の直交変換であり、従ってに吸収することができる軸を置換することができる）ことを保証するために、 $\mathbb A$ $\Sigma$ $\mathbb P$

σ_{1} \geq σ_{2} \geq \dots \geq σ_{n} \geq 0.

$\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n \ge 0.$

我々が許可すれば新たな座標である（伴う、関数）あります $x=\mathbb{P}^\prime y$ $x$ $y=\mathbb{P}x$ $f$

f (y) = y^{'} A y = x^{'} P^{'} A P x = x^{'} Σ x = σ_{1} x_{1}^{2} + σ_{2} x_{2}^{2} + \dots + σ_{n} x_{n}^{2} .

$f(y) = y^\prime \mathbb{A} y = x^\prime \mathbb{P^\prime A P} x = x^\prime \Sigma x = \sigma_1 x_1^2 + \sigma_2 x_2^2 + \cdots + \sigma_n x_n^2.$

This function is decidedly not convex! Its graph looks like part of a hyperparaboloid: at every point in the interior of $\mathcal X$ , the fact that all the $\sigma_i$ are nonnegative makes it curl upward rather than downward.

However, we can turn $(**)$ into a convex problem with one very useful technique. Knowing that the maximum will occur where $x^\prime x = 1$ , let's subtract the constant $\sigma_1$ from $f$ , at least for points on the boundary of $\mathcal{X}$ . That will not change the locations of any points on the boundary at which $f$ is optimized, because it lowers all the values of $f$ on the boundary by the same value $\sigma_1$ . This suggests examining the function

g (y) = f (y) - σ_{1} y^{'} y .

$g(y) = f(y) - \sigma_1 y^\prime y.$

This indeed subtracts the constant $\sigma_1$ from $f$ at boundary points, and subtracts smaller values at interior points. This will assure that $g$ , compared to $f$ , has no new global maxima on the interior of $\mathcal X$ .

Let's examine what has happened with this sleight-of-hand of replacing $-\sigma_1$ by $-\sigma_1 y^\prime y$ . Because $\mathbb P$ is orthogonal, $y^\prime y = x^\prime x$ . (That's practically the definition of an orthogonal transformation.) Therefore, in terms of the $x$ coordinates, $g$ can be written

g (y) = σ_{1} x_{1}^{2} + \dots + σ_{n} x_{n}^{2} - σ_{1} (x_{1}^{2} + \dots + x_{n}^{2}) = (σ_{2} - σ_{1}) x_{2}^{2} + \dots + (σ_{n} - σ_{1}) x_{n}^{2} .

$g(y) = \sigma_1 x_1 ^2 + \cdots + \sigma_n x_n^2 - \sigma_1(x_1^2 + \cdots + x_n^2) = (\sigma_2-\sigma_1)x_2^2 + \cdots + (\sigma_n - \sigma_1)x_n^2.$

Because $\sigma_1 \ge \sigma_i$ for all $i$ , each of the coefficients is zero or negative. Consequently, (a) $g$ is convex and (b) $g$ is optimized when $x_2=x_3=\cdots=x_n=0$ . ( $x^\prime x=1$ then implies $x_1=\pm 1$ and the optimum is attained when $y = \mathbb{P} (\pm 1,0,\ldots, 0)^\prime$ , which is--up to sign--the first column of $\mathbb P$ .)

Let's recapitulate the logic. Because $g$ is optimized on the boundary $\partial D^n=S^{n-1}$ where $y^\prime y = 1$ , because $f$ differs from $g$ merely by the constant $\sigma_1$ on that boundary, and because the values of $g$ are even closer to the values of $f$ on the interior of $D^n$ , the maxima of $f$ must coincide with the maxima of $g$ .

— whuber
ソース

+1 Very nice. I edited to fix one formula to what I think you intended (but please check). Apart from that, I found the sentence "That won't change any boundary values at which f is optimized" to be confusing at first, because the boundary values do change: you are subtracting

σ_{1}

$\sigma_1$ . Maybe it makes sense to reformulate a bit?

— amoeba says Reinstate Monica

@amoeba Right on all counts; thank you. I have amplified the discussion of that point.

— whuber

(+1) In your answer, you seem to define a convex function to be what most people would consider to be a concave function (perhaps since a convex optimization problem has a convex domain and a concave function over which a maximum is computed (or a convex function over which a minimum is computed))

— user795305

@amoeba It's a subtle argument. Note, however, that the new maxima--those of

g

$g$ --are found to occur only on the boundary. That rules out your counterexamples. Another point worth noting is that in the end we don't really care whether new local (or even global) maxima happen to show up in the interior of

X

$\mathcal X$ , because we are originally concerned only about local maxima on its boundary. We are therefore free to alter

f

$f$ in any way that will not make any of those local boundary maxima move or disappear.

— whuber

Yes, I agree. It does not matter how

f

$f$ is modified on the inside, if the resulting

g

$g$ is "convex" and happens to have maxima on the boundary. Your

g

$g$ does happen to have maxima on the boundary, and this makes the whole argument work. Makes sense.

— amoeba says Reinstate Monica

No.

Rank $k$ PCA of matrix $M$ can be formulated as

$\hat{X} = \underset{rank(X) \leq k}{argmin} \| M - X\|_F^2$

( $\|\cdot\|_F$ is Frobenius norm). For derivation see Eckart-Young theorem.

Though the norm is convex, the set over which it is optimized is nonconvex.

A convex relaxation of PCA's problem is called Convex Low Rank Approximation

$\hat{X} = \underset{\|X\|_* \leq c}{argmin} \| M - X\|_F^2$

( $\|\cdot\|_*$ is nuclear norm. it's convex relaxation of rank - just like $\|\cdot\|_1$ is convex relaxation of number of nonzero elements for vectors)

You can see Statistical Learning with Sparsity, ch 6 (matrix decompositions) for details.

If you're interested in more general problems and how they relate to convexity, see Generalized Low Rank Models.

— Jakub Bartczuk
ソース

Disclaimer: The previous answers do a pretty good job of explaining how PCA in its original formulation is non-convex but can be converted to a convex optimization problem. My answer is only meant for those poor souls (such as me) who are not so familiar with the jargon of Unit Spheres and SVDs - which is, btw, good to know.

My source is this lecture notes by Prof. Tibshirani

For an optimization problem to be solved with convex optimization techniques, there are two prerequisites.

The objective function has to be convex.
The constraint functions should also be convex.

Most formulations of PCA involve a constraint on the rank of a matrix.

In these type of PCA formulations, condition 2 is violated. Because, the constraint that $rank(X) = k,$ is not convex. For example, let $J_{11}$ , $J_{22}$ be 2 × 2 zero matrices with a single 1 in the upper left corner and lower right corner respectively. Then, each of these have rank 1, but their average has rank 2.

— kasa
ソース

Could you please explain what "

X

$X$ " refers to and why there is any constraint on its rank? This doesn't correspond with my understanding of PCA, but perhaps you are thinking of a more specialized version in which only

k

$k$ principal components are sought.

— whuber

Yeah,

X

$X$ is the transformed (rotated) data matrix. In this formulation, we seek matrices that are at least of rank

k

$k$ . You can refer to the link in my answer for a more accurate description.

— kasa