対称正定値（SPD）行列がそれほど重要なのはなぜですか？

20

対称正定値（SPD）行列の定義は知っていますが、もっと理解したいです。

なぜ、直感的に重要なのですか？

これが私が知っていることです。ほかに何か？

特定のデータの場合、共分散行列はSPDです。共分散行列は重要なメトリックです。直感的な説明については、この優れた投稿を参照してください。
二次形式 $\frac 1 2 x^\top Ax-b^\top x +c$ あれば、凸状であり、 $A$ SPDです。凸は、ローカルソリューションがグローバルソリューションであることを確認できる関数の優れたプロパティです。Convexの問題には、解決すべき多くの優れたアルゴリズムがありますが、covex以外の問題にはありません。
$A$ がSPDの場合、2次形式の最適化ソリューションは
$minimize \frac{1}{2} x^{⊤} A x - b^{⊤} x + c$ $\text{minimize}~~~ \frac 1 2 x^\top Ax-b^\top x +c$ と線形システムのための溶液 $A x = b$ $Ax=b$ 同じです。したがって、2つの古典的な問題間で変換を実行できます。これは、あるドメインで発見されたトリックを別のドメインで使用できるため、重要です。たとえば、共役勾配法を使用して線形システムを解くことができます。
コレスキー分解など、SPDマトリックスに適した多くの優れたアルゴリズム（高速で安定した数値）があります。

編集：私はSPD行列のアイデンティティを尋ねるのではなく、重要性を示すためにプロパティの背後にある直観を求めています。たとえば、@ Matthew Druryが述べたように、行列がSPDの場合、固有値はすべて正の実数ですが、なぜすべてが正であるかが重要です。@Matthew Druryはフローに対して素晴らしい回答をしてくれました。

— ハイタオドゥ
ソース

7

固有値はすべて正の実数です。この事実は、他の多くの根底にあります。

— マシュードゥルーリー

4

@Matthewより少し先に進むには：適切な基底を選択すると、そのような行列はすべて同じであり、単位行列と等しくなります。言い換えれば、各次元に正定値の2次形式が1つだけ存在し（実数ベクトル空間の場合）、それはユークリッド距離と同じです。

— whuber

2

あなたは実対称行列の固有値はすべて本物です示すの多くの基本的な方法でいくつかの直感を見つけることができます：mathoverflow.net/questions/118626/... 特に、二次形式

レイリー商で自然に発生し、対称行列は、固有値が実数である行列の大きなファミリーを示す自然な方法を提供します。例えばクーランミニマックス定理を参照してください。en.wikipedia.org/wiki/Courant_minimax_principle

x^{T} A x

$x^TAx$

— アレックスR.

4

これは非常に広範に思えます;もしそれが3つの答えをまだ持っていなかったなら、私はそのベースでそれを閉じたでしょう。あなたが特に知りたいことについて、より多くのガイダンスを提供してください（直観を求めることは、このようなケースでは人々が推測するにはあまりにも個人的/個人的すぎる）

— Glen_b -Reinstate Monica

1

私は、PSDではない行列を生成する統計の状況を見つけるのに苦労しています（たとえば、欠損値を持つデータで計算されたペアワイズ相関でそれを埋めることによって相関行列を計算することに失敗した場合を除きます）。私が考えることができる正方対称行列は、共分散、情報、または射影行列のいずれかです。（他の応用数学では、非psd行列は文化的規範である可能性があります。たとえば、PDEの有限要素行列です。）

— StasK

15

（実）対称行列には、対応する固有値がすべて実数である直交固有ベクトルの完全なセットがあります。非対称マトリックスの場合、これは失敗する可能性があります。たとえば、2次元空間の回転には、実数の固有ベクトルまたは固有値がありません。それらを見つけるには、複素数上のベクトル空間に渡す必要があります。

行列がさらに正定値である場合、これらの固有値はすべて正の実数です。が単位長さの固有ベクトルであり、が対応する固有値である場合、この事実は最初よりもはるかに簡単です。 $v$ $\lambda$

λ = λ v^{t} v = v^{t} A v > 0

$\lambda = \lambda v^t v = v^t A v > 0$

ここで、最後の平等は正定性の定義を使用します。

ここでの直観にとって重要なことは、線形変換の固有ベクトルと固有値が、変換が最も容易に理解される座標系を記述することです。線形変換は、標準の座標系のように「自然」に理解するのは非常に困難ですが、それぞれが変換がすべての方向のスケーリングとして機能する固有ベクトルの「優先」基底を備えています。これにより、変換のジオメトリが理解しやすくなります。

たとえば、関数の局所極値の2次導関数検定は、2次導関数行列といくつかの行列式のエントリを含む一連の不可解な条件として与えられることがよくあります。実際、これらの条件は単純に次の幾何学的観察をエンコードします。 $R^2 \rightarrow R$

二次導関数の行列が正定値の場合、極小値になります。
二次導関数の行列が負定値の場合、極大値になります。
それ以外の場合は、どちらもサドルポイントではありません。

このことは、固有基底の上記の幾何学的推論で理解できます。臨界点の一次導関数は消滅するため、ここでの関数の変化率は二次導関数によって制御されます。今、幾何学的に推論することができます

前者の場合、2つの固有方向があり、どちらかに沿って移動すると、関数が増加します。
2つ目は、2つの固有方向で、どちらかに移動すると、関数は減少します。
最後に、2つの固有方向がありますが、一方では関数が増加し、他方では減少します。

固有ベクトルは空間全体に広がるため、他の方向は固有方向の線形結合であるため、これらの方向の変化率は固有方向の変化率の線形結合です。したがって、実際には、これはすべての方向に当てはまります（これは、多次元空間で定義された関数が微分可能であることを意味します）。さて、頭に小さな絵を描くと、初心者の微積分のテキストでは非常に神秘的なものになります。

これは、箇条書きのいずれかに直接適用されます

二次形式あれば、凸状であり、 $\frac 1 2 x^\top Ax-b^\top x +c$ $A$ SPDです。Convexは、ローカルソリューションがグローバルソリューションであることを確認できる優れたプロパティです

2次導関数の行列はどこでもであり、対称正定行列です。（したがって、我々は、任意の固有の方向に離れる場合に、幾何学的に、これは、任意の関数自体が離れて曲がる他の固有方向の線形結合であるため、方向）の上方に、それの接平面。これは、表面全体が凸面であることを意味します。 $A$

— マシュー・ドゥルーリー
ソース

5

グラフィカルな見方：

がSPDの場合、関連する2次形式の輪郭は楕円形です。

A

$\mathbf A$

— JMは統計家ではありません

7

@JMによるその特徴づけは非常に知覚的です。だれかが楕円体の輪郭について特別なことを考えている場合、それらは完全に変装した完全な球体であることに注意してください：測定の単位は主軸に沿って異なる場合があり、楕円体はデータが記述される座標に関して回転する場合があります、しかし、非常に多くの目的、特に概念的な目的のために、これらの違いは重要ではありません。

— whuber

それは、ニュートンの方法を幾何学的に理解する私の方法に関連しています。楕円体で現在のレベルセットを最適に近似し、楕円体が円である座標系を取得し、その座標系で円に直交して移動します。

— マシュードゥルーリー

1

（アクティブな）制約がある場合、固有値と固有方向のスピルを実行する前に、アクティブな制約のヤコビアンに投影する必要があります。ヘッセ行列がpsdの場合、（任意の）投影はpsdになりますが、その逆は必ずしも真実ではなく、多くの場合そうではありません。私の答えをご覧ください。

— マークL.ストーン

10

実対称行列の固有値がすべて実であることを示す多くの基本的な方法でいくつかの直感を見つけることができます：https : //mathoverflow.net/questions/118626/real-symmetric-matrix-has-real-eigenvalues-elementary- proof / 118640＃118640

特に、二次形式はレイリー商で自然に発生し、対称行列は、固有値が実数である行列の大きなファミリーを示す最も自然な方法を提供します。たとえば、Courantミニマックスの定理を参照してください：https://en.wikipedia.org/wiki/Courant_minimax_principle $x^TAx$

：また、対称、厳密に正定値行列のみ誘導さノルムと共に、非自明な内積を定義することができる行列のセットされている。これは、実際のベクトルの定義によるものであるので、すべてのため及び $d(x,y)=\langle x,Ay\rangle=x^TAy$ $x,y$ $d(x,y)=d(y,x)$ $x,y$ 。このように、対称正定行列は、座標変換の理想的な候補と見なすことができます。のために $\|x\|^2=x^TAx>0$ $x\neq 0$

この後者の特性は、サポートベクターマシン、特にカーネルメソッドとカーネルトリックの分野で絶対に重要です。カーネルメソッドとカーネルトリックでは、正しい内積を誘導するためにカーネルは対称正でなければなりません。実際、マーサーの定理は、対称行列の直感的な特性を機能空間に一般化します。

— アレックス・R
ソース

9

$f(x + \Delta x)$ :

f (x + Δ x) \approx f (x) + Δ x^{T} \nabla f (x) + \frac{1}{2} Δ x^{T} \nabla^{2} f (x) Δ x

$f(x + \Delta x)\approx f(x) + \Delta x^T \nabla f(x)+ \frac{1}{2} \Delta x^T \nabla^2 f(x) \Delta x$

Next, we take the derivative with respect to $\Delta x$ :

f^{'} (x + Δ x) \approx \nabla f (x) + \nabla^{2} f (x) Δ x

$f'(x + \Delta x)\approx \nabla f(x) + \nabla^2 f(x) \Delta x$

Finally, set the derivative equal to 0 and solve for $\Delta x$ :

Δ x = - \nabla^{2} f (x)^{- 1} \nabla f (x)

$\Delta x = -\nabla^2 f(x)^{-1} \nabla f(x)$

Assuming $\nabla^2 f(x)$ is SPD, it is easy to see that $\Delta x$ is a descent direction because:

\nabla f (x)^{T} Δ x = - \nabla f (x)^{T} \nabla^{2} f (x)^{- 1} \nabla f (x) < 0

$\nabla f(x)^T \Delta x = -\nabla f(x)^T \nabla^2 f(x)^{-1} \nabla f(x) < 0$

When using Newton's method, non-SPD Hessian matrices are typically "nudged" to be SPD. There's a neat algorithm called modified Cholesky that will detect a non-SPD Hessian, "nudge" it appropriately in the right direction and factorize the result, all for (essentially) the same cost as a Cholesky factorization. Quasi-Newton methods avoid this problem by forcing the approximate Hessian to be SPD.

As an aside, symmetric indefinite systems are receiving a lot of attention these days. They come up in the context of interior point methods for constrained optimization.

— Bill Woessner
ソース

Thank you very much for great answer. I understand decent direction is important in line search method. In trust region methods, decent direction is also important?

— Haitao Du

1

It is still important for trust region methods. Trust region methods basically work by bounding the step size FIRST and then solving for the step direction. If the step does not achieve the desired decrease in objective function value, you reduce the bound on the step size and start over. Imagine that your algorithm for generating the step direction does not guarantee that the step direction is a descent direction. Even as the radius of the trust region goes to 0, you may never generate an acceptable step (even if one exists) because none of your step directions are descent directions.

— Bill Woessner

Line search methods basically exhibit the same behavior. If your search direction is not a descent direction, the line search algorithm may never find an acceptable step length - because there isn't one. :-)

— Bill Woessner

Great answer, thank you for helping me to connect the pieces.

— Haitao Du

9

Geometrically, a positive definite matrix defines a metric, for instance a Riemannian metric, so we can immediately use geometric concepts.

If $x$ and $y$ are vectors and $A$ a positive definite matrix, then

d (x, y) = \sqrt{(x - y)^{T} A (x - y)}

$d(x,y) = \sqrt{(x-y)^T A (x-y)}$ is a metric (also called distance function).

In addition, positive definite matrices are related to inner product: In $\mathbb{R}^n$ , we can define an inner product by

⟨ x, y ⟩ = x^{T} A y

$\langle x,y \rangle = x^T A y$ where

A

$A$ as above is positive definite. More, all inner products on

R^{n}

$\mathbb{R}^n$ arises in this way.

— kjetil b halvorsen
ソース

1

...and of course the usual distance has

A = I

$\mathbf A=\mathbf I$ ...

— J. M. is not a statistician

6

There are already several answers explaining why symmetric positive definite matrices are so important, so I will provide an answer explaining why they are not as important as some people, including the authors of some of those answers, think. For the sake of simplicity, I will limit focus to symmetric matrices, and concentrate on Hessians and optimization.

If God had made the world convex, there wouldn't be convex optimization, there would just be optimization. Similarly, there wouldn't be (symmetric) positive definite matrices, there would just be (symmetric) matrices. But that's not the case, so deal with it.

If a Quadratic Programming problem is convex, it can be solved "easily". If it is non-convex, a global optimum can still be found using branch and bound methods (but it may take longer and more memory).

If a Newton method is used for optimization and the Hessian at some iterate is indefinite, then it is not necessary to "finagle" it to positive definiteness. If using a line search, directions of negative curvature can be found and the line search executed along them, and if using a trust region, then there is some small enough trust region such that the solution of the trust region problem achieves descent.

As for Quasi-Newton methods, BFGS (damped if the problem is constrained) and DFP maintain positive definiteness of the Hessian or inverse Hessian approximation. Other Quasi-Newton methods, such as SR1 (Symmetric Rank One) do not necessarily maintain positive definiteness. Before you get all bent out of shape over that, that is a good reason for choosing SR1 for many problems - if the Hessian really isn't positive definite along the path to the optimum, then forcing the Quasi-Newton approximation to be positive definite may result in a lousy quadratic approximation to the objective function. By contrast, the SR1 updating method is "loose as a goose", and can writhely morph its definiteness as it proceeds along.

For nonlinearly constrained optimization problems, what really matters is not the Hessian of the objective function, but the Hessian of the Lagrangian. The Hessian of the Lagrangian may be indefinite even at an (the) optimum, and indeed, it is only the projection of the Hessian of the Lagrangian into the nullspace of the Jacobian of the active (linear and nonlinear) constraints which need be positive semi-definite at the optimum. If you model the Hessian of the Lagrangian via BFGS and thereby constrain it to be positive definite, it might be a terrible fit everywhere, and not work well. By contrast, SR1 can adapt its eigenvalues to what it actually "sees".

There's much more that I could say about all of this, but this is enough to give you a flavor.

Edit: What I wrote 2 paragraphs up is correct. However, I forgot to point out that it also applies to linearly constrained problems. In the case of linearly constrained problems, the Hessian of the Lagrangian is just (reduces down to) the Hessian of the objective function. So the 2nd order optimality condition for a local minimum is that the projection of the Hessian of the objective function into the nullspace of the Jacobian of the active constraints is positive semi-definite. Most notably, the Hessian of the objective function need not (necessarily) be psd at the optimum, and often isn't, even on linearly constrained problems.

— Mark L. Stone
ソース

"Who's Afraid of Non-Convex Loss Functions?" ... not @MarkL.Stone

— GeoMatt22

@GeoMatt22 You bet your @$$ I'm not. On the other hand, if you are going to create (choose) a loss function, there's no need to make it non-convex when it serves no good purpose other than show-boating. Discretion is the better part of valor.

— Mark L. Stone

@Mark L. Stone: This is interesting! Can you give reference to some literature where I can read about such things?

— kjetil b halvorsen

@kjetil b halvorsen . Line search with directions of negative curvature folk.uib.no/ssu029/Pdf_file/Curvilinear/More79.pdf . Trust regions are covered in many books and papers. Well-known book with good intro to trust regions is amazon.com/… .. Monster book, somewhat out of date now, is epubs.siam.org/doi/book/10.1137/1.9780898719857 . As for my last paragraph about optimality conditions, read up on 2nd order KKT conditions

— Mark L. Stone

@kjetil b halvorsen I didn't address finding global optimum of non-convex Quadratic Program. Widely available software, such as CPLEX, can do this, see ibm.com/support/knowledgecenter/SS9UKU_12.6.1/… . Of course it is not always fast, and may need some memory. I've solved to global optimality some QP minimization problems with tens of thousands of variables which had several hundred signficant magnitude negative eigenvalues.

— Mark L. Stone

5

You already cited a bunch of reasons why SPD are important yet you still posted the question. So, it seems to me that you need to answer this question first: Why do positive quantities matter?

My answer is that some quantities ought to be positive in order to reconcile with our experiences or models. For instance, the distances between items in the space have to be positive. The coordinates can be negative, but the distances are always non-negative. Hence, if you have a data set and some algorithm that processes it you may well end up with one that breaks down when you feed a negative distance into it. So, you say "my algorithm requires positive distance inputs at all times", and it wouldn't sound like an unreasonable demand.

In the context of statistics, a better analogy would be the variance. So, we calculate the variance as

\sum_{i} (x_{i} - μ)^{2} / n

$\sum_i (x_i-\mu)^2/n$ It's obvious from the definition that if you feed in the real numbers

x_{i}

$x_i$ into the equation the output is always non-negative. Hence, you may build algorithms that work with non-negative numbers, and they may be more efficient than algorithm without this restriction. That's the reason we use them.

So, variance-covariance matrices are positive semi-definite, i.e. "non-negative" in this analogy. The example of an algorithm that requires this condition is Cholesky decomposition, it's very handy. It's often called a "square root of the matrix". So, like the square root of a real number that requires non-negativity, Cholesky wants non-negative matrices. We don't find this constraining when dealing with covariance matrices because they always are.

So, that's my utilitarian answer. The constraints such as non-negativity or SPD allow us build more efficient calculation algorithm or convenient modeling tools that are available when your inputs satisfy these constraints.

— Aksakal
ソース

3

Here are two more reasons which haven't been mentioned for why positive-semidefinite matrices are important:

The graph Laplacian matrix is diagonally dominant and thus PSD.
Positive semidefiniteness defines a partial order on the set of symmetric matrices (this is the foundation of semidefinite programming).

— Thoth
ソース