回帰係数を正規化する方法に関する質問

正規化がここで使用する正しい単語であるかどうかはわかりませんが、私が尋ねようとしていることを説明するために最善を尽くします。ここで使用される推定量は最小二乗です。

、で平均を中心にできると仮定します。ここでおよび、それもはや推定には影響ありません。 $y=\beta_0+\beta_1x_1$ $y=\beta_0'+\beta_1x_1'$ $\beta_0'=\beta_0+\beta_1\bar x_1$ $x_1'=x-\bar x$ $\beta_0'$ $\beta_1$

このI平均することにより中と同等です中。最小二乗計算を簡単にするために方程式を減らしました。 $\hat\beta_1$ $y=\beta_1x_1'$ $\hat\beta_1$ $y=\beta_0+\beta_1x_1$

一般的にこの方法をどのように適用しますか？モデルがになりました。これをに削減しようとしています。 $y=\beta_1e^{x_1t}+\beta_2e^{x_2t}$ $y=\beta_1x'$

— セイバーCN
ソース

分析しているデータの種類と、モデルから共変量を削除する理由。また、インターセプトを削除する理由はありますか？データを平均中心にすると、切片の有無にかかわらずモデルの勾配は同じになりますが、切片のあるモデルはデータによりよく適合します。

e^{x_{1} t}

$e^{x_1t}$

— caburke

@caburkeとを計算した後、それらをモデルに戻すことができるため、モデルの適合性については心配していません。この演習のためのポイントは、推定することである

。のみに元の方程式削減することにより

、最小二乗計算が容易になります（xは」私は見つけるためにしようとしています何の一部であり、それが含まれる

）。私はメカニズムを学ぼうとしています、これはテューキーの本からの質問です。

β_{1}

$\beta_1$

β_{2}

$\beta_2$

β_{1}

$\beta_1$

y = β_{1} x^{'}

$y=\beta_1x'$

e^{x_{1} t}

$e^{x_1t}$

— セイバーCN

@caコメントの最後の観察結果は不可解です。非線形式には適用できない可能性があります-「スロープ」と合理的にみなせるものは含まれていませんが、OLS設定では正しくありません。平均中心データの適合は、切片に適合します。お使いのモデルがあいまいであるセイバー、：どの

パラメータである変数としていますか？意図したエラー構造は何ですか？（そして、Tukey

β_{1}, β_{2}, x_{1}, x_{2}, t

$\beta_1, \beta_2, x_1, x_2, t$

— のど

@whuberこれは、Tukeyの本「データ分析と回帰：統計学の2番目のコース」の第14A章からのものです。

、我々が推定しようとしているパラメータがあり、

n個の観測と変数は、それぞれ

、私は仮定の観測に関連する時間変数である、しかし、それは指定されていませんでした。エラーは正常なはずで、この質問では無視できます。

β_{1}, β_{2}

$\beta_1,\beta_2$

x_{1}, x_{2}

$x_1,x_2$

t

$t$

— セイバーCN

@whuber私は主に投稿の最初の部分について言及していましたが、これは私のコメントでは明確ではありませんでした。私が意味したのは、OPでは提案されていたように

ではなく

平均中心にし、切片を削除すると、

場合とは限らないため、フィットが悪化することです。OPの最後の行に記載されているモデルの係数は、明らかにスロープではありません。

x

$x$

y

$y$

\bar{y} = 0

$\bar{y}=0$

— caburke

ここでの質問に正義をかけることはできませんが、小さなモノグラフが必要になりますが、いくつかの重要なアイデアを要約すると役立つかもしれません。

質問

質問を書き直し、明確な用語を使用することから始めましょう。データは、順序付けられたペアのリストで構成さ。 既知の定数及び値決定と。モデルを仮定します $(t_i, y_i)$ $\alpha_1$ $\alpha_2$ $x_{1,i} = \exp(\alpha_1 t_i)$ $x_{2,i} = \exp(\alpha_2 t_i)$

y_{i} = β_{1} x_{1, i} + β_{2} x_{2, i} + ε_{i}

$y_i = \beta_1 x_{1,i} + \beta_2 x_{2,i} + \varepsilon_i$

以下のための定数 および、推定すべきとにかく良い近似に- -ランダムであり、独立した（その推定関心のもある）、共通の分散を持ちます。 $\beta_1$ $\beta_2$ $\varepsilon_i$

背景：線形「マッチング」

MostellerとTukeyの変数を参照し = 及びとして"照合プログラム。" これらは、特定の方法での値を「一致」させるために使用されます。より一般的には、と同じユークリッドベクトル空間内の任意の2つのベクトルとし、は「ターゲット」と役割を果たす $x_1$ $(x_{1,1}, x_{1,2}, \ldots)$ $x_2$ $y = (y_1, y_2, \ldots)$ $y$ $x$ $y$ $x$ 「マッチャー」のそれ。我々は、系統的係数変化企図近似するために、複数によって。場合に最良の近似が得られる近くにあるできるだけ。等価的に、二乗の長さ最小化されます。 $\lambda$ $y$ $\lambda x$ $\lambda x$ $y$ $y - \lambda x$

このマッチングプロセスを視覚化する一つの方法は、散布することである及びのグラフ描画された。散布点とこのグラフの間の垂直距離は、構成要素である残差ベクトル。それらの平方の合計はできるだけ小さくする必要があります。比例定数まで、これらの正方形は、残差に等しい半径を持つ点を中心とする円の面積です。これらすべての円の面積の合計を最小化します。 $x$ $y$ $x \to \lambda x$ $y - \lambda x$ $(x_i, y_i)$

中央のパネルにの最適値を示す例を次に示します。 $\lambda$

Panel

散布図の点は青です。グラフ赤線です。この図は、赤線が原点を通過するように拘束されていることを強調して：それは、ラインフィッティングの非常に特殊な場合です。 $x \to \lambda x$ $(0,0)$

逐次マッチングにより多重回帰を取得できます

質問の設定に戻ると、1つのターゲットと2つのマッチャーおよびます。ここでも、がによって可能な限り近似される数値およびを求めます。任意にで始まり、Mosteller＆Tukeyは残りの変数およびを一致させます $y$ $x_1$ $x_2$ $b_1$ $b_2$ $y$ $b_1 x_1 + b_2 x_2$ $x_1$ $x_2$ $y$ $x_1$ 。これらのマッチの残差を書く及びそれぞれ：いることを示し変数「の取り出し」されています。 $x_{2\cdot 1}$ $y_{\cdot 1}$ $_{\cdot 1}$ $x_1$

我々は書ける

y = λ_{1} x_{1} + y_{\cdot 1} and x_{2} = λ_{2} x_{1} + x_{2 \cdot 1} .

$y = \lambda_1 x_1 + y_{\cdot 1}\text{ and }x_2 = \lambda_2 x_1 + x_{2\cdot 1}.$

採取したから及び、我々は目標残差に一致するように進む整合残差には、。最終の残差である。代数的に、私たちは書きました $x_1$ $x_2$ $y$ $y_{\cdot 1}$ $x_{2\cdot 1}$ $y_{\cdot 12}$

\begin{aligned} y_{\cdot 1} & = λ_{3} x_{2 \cdot 1} + y_{\cdot 12}; whence \\ y & = λ_{1} x_{1} + y_{\cdot 1} = λ_{1} x_{1} + λ_{3} x_{2 \cdot 1} + y_{\cdot 12} = λ_{1} x_{1} + λ_{3} (x_{2} - λ_{2} x_{1}) + y_{\cdot 12} \\ = (λ_{1} - λ_{3} λ_{2}) x_{1} + λ_{3} x_{2} + y_{\cdot 12} . \end{aligned}

$\eqalign{ y_{\cdot 1} &= \lambda_3 x_{2\cdot 1} + y_{\cdot 12}; \text{ whence} \\ y &= \lambda_1 x_1 + y_{\cdot 1} = \lambda_1 x_1 + \lambda_3 x_{2\cdot 1} + y_{\cdot 12} =\lambda_1 x_1 + \lambda_3 \left(x_2 - \lambda_2 x_1\right) + y_{\cdot 12} \\ &=\left(\lambda_1 - \lambda_3 \lambda_2\right)x_1 + \lambda_3 x_2 + y_{\cdot 12}. }$

このことが示す最後のステップでは、係数でありのマッチングにおける及びの。 $\lambda_3$ $x_2$ $x_1$ $x_2$ $y$

我々は、ちょうど同様に最初の撮影により進行している可能性がのうち及び製造、及び、次にとるのうち残差の異なる組得、。この時間は、係数最後のステップで見つかった-レッツ・コール、それは係数--is のマッチングでおよび $x_2$ $x_1$ $y$ $x_{1\cdot 2}$ $y_{\cdot 2}$ $x_{1\cdot 2}$ $y_{\cdot 2}$ $y_{\cdot 21}$ $x_1$ $\mu_3$ $x_1$ $x_1$ $x_2$ to $y$ .

Finally, for comparison, we might run a multiple (ordinary least squares regression) of $y$ against $x_1$ and $x_2$ . Let those residuals be $y_{\cdot lm}$ . It turns out that the coefficients in this multiple regression are precisely the coefficients $\mu_3$ and $\lambda_3$ found previously and that all three sets of residuals, $y_{\cdot 12}$ , $y_{\cdot 21}$ , and $y_{\cdot lm}$ , are identical.

Depicting the process

これは新しいものではありません：それはすべて本文にあります。これまでに取得したすべての散布図マトリックスを使用して、画像分析を提供したいと思います。

Scatterplot

これらのデータは、シミュレートされているので、我々は、の根底にある「真」の値を示すの高級有する最後の行と列には：これらは値がで添加エラーなし。 $y$ $\beta_1 x_1 + \beta_2 x_2$

The scatterplots below the diagonal have been decorated with the graphs of the matchers, exactly as in the first figure. Graphs with zero slopes are drawn in red: these indicate situations where the matcher gives us nothing new; the residuals are the same as the target. Also, for reference, the origin (wherever it appears within a plot) is shown as an open red circle: recall that all possible matching lines have to pass through this point.

Much can be learned about regression through studying this plot. Some of the highlights are:

The matching of $x_2$ to $x_1$ (row 2, column 1) is poor. This is a good thing: it indicates that $x_1$ and $x_2$ are providing very different information; using both together will likely be a much better fit to $y$ than using either one alone.
Once a variable has been taken out of a target, it does no good to try to take that variable out again: the best matching line will be zero. See the scatterplots for $x_{2\cdot 1}$ versus $x_1$ or $y_{\cdot 1}$ versus $x_1$ , for instance.
The values $x_1$ , $x_2$ , $x_{1\cdot 2}$ , and $x_{2\cdot 1}$ have all been taken out of $y_{\cdot lm}$ .
Multiple regression of $y$ against $x_1$ and $x_2$ can be achieved first by computing $y_{\cdot 1}$ and $x_{2\cdot 1}$ . These scatterplots appear at (row, column) = $(8,1)$ and $(2,1)$ , respectively. With these residuals in hand, we look at their scatterplot at $(4,3)$ . These three one-variable regressions do the trick. As Mosteller & Tukey explain, the standard errors of the coefficients can be obtained almost as easily from these regressions, too--but that's not the topic of this question, so I will stop here.

Code

These data were (reproducibly) created in R with a simulation. The analyses, checks, and plots were also produced with R. This is the code.

#
# Simulate the data.
#
set.seed(17)
t.var <- 1:50                                    # The "times" t[i]
x <- exp(t.var %o% c(x1=-0.1, x2=0.025) )        # The two "matchers" x[1,] and x[2,]
beta <- c(5, -1)                                 # The (unknown) coefficients
sigma <- 1/2                                     # Standard deviation of the errors
error <- sigma * rnorm(length(t.var))            # Simulated errors
y <- (y.true <- as.vector(x %*% beta)) + error   # True and simulated y values
data <- data.frame(t.var, x, y, y.true)

par(col="Black", bty="o", lty=0, pch=1)
pairs(data)                                      # Get a close look at the data
#
# Take out the various matchers.
#
take.out <- function(y, x) {fit <- lm(y ~ x - 1); resid(fit)}
data <- transform(transform(data, 
  x2.1 = take.out(x2, x1),
  y.1 = take.out(y, x1),
  x1.2 = take.out(x1, x2),
  y.2 = take.out(y, x2)
), 
  y.21 = take.out(y.2, x1.2),
  y.12 = take.out(y.1, x2.1)
)
data$y.lm <- resid(lm(y ~ x - 1))               # Multiple regression for comparison
#
# Analysis.
#
# Reorder the dataframe (for presentation):
data <- data[c(1:3, 5:12, 4)]

# Confirm that the three ways to obtain the fit are the same:
pairs(subset(data, select=c(y.12, y.21, y.lm)))

# Explore what happened:
panel.lm <- function (x, y, col=par("col"), bg=NA, pch=par("pch"),
   cex=1, col.smooth="red",  ...) {
  box(col="Gray", bty="o")
  ok <- is.finite(x) & is.finite(y)
  if (any(ok))  {
    b <- coef(lm(y[ok] ~ x[ok] - 1))
    col0 <- ifelse(abs(b) < 10^-8, "Red", "Blue")
    lwd0 <- ifelse(abs(b) < 10^-8, 3, 2)
    abline(c(0, b), col=col0, lwd=lwd0)
  }
  points(x, y, pch = pch, col="Black", bg = bg, cex = cex)    
  points(matrix(c(0,0), nrow=1), col="Red", pch=1)
}
panel.hist <- function(x, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y,  ...)
}
par(lty=1, pch=19, col="Gray")
pairs(subset(data, select=c(-t.var, -y.12, -y.21)), col="Gray", cex=0.8, 
   lower.panel=panel.lm, diag.panel=panel.hist)

# Additional interesting plots:
par(col="Black", pch=1)
#pairs(subset(data, select=c(-t.var, -x1.2, -y.2, -y.21)))
#pairs(subset(data, select=c(-t.var, -x1, -x2)))
#pairs(subset(data, select=c(x2.1, y.1, y.12)))

# Details of the variances, showing how to obtain multiple regression
# standard errors from the OLS matches.
norm <- function(x) sqrt(sum(x * x))
lapply(data, norm)
s <- summary(lm(y ~ x1 + x2 - 1, data=data))
c(s$sigma, s$coefficients["x1", "Std. Error"] * norm(data$x1.2)) # Equal
c(s$sigma, s$coefficients["x2", "Std. Error"] * norm(data$x2.1)) # Equal
c(s$sigma, norm(data$y.12) / sqrt(length(data$y.12) - 2))        # Equal

— whuber
ソース

Could multiple regression of

y

$y$ against

x_{1}

$x_1$ and

x_{2}

$x_2$ still be achieved by first computing

y_{.1}

$y_{.1}$ and

x_{2.1}

$x_{2.1}$ if

x_{1}

$x_1$ and

x_{2}

$x_2$ were correlated? Wouldn't it then make a big difference whether we sequentially regressed

y

$y$ on

x_{1}

$x_1$ and

x_{2.1}

$x_{2.1}$ or on

x_{2}

$x_2$ and

x_{1.2}

$x_{1.2}$ ? How does this relate to one regression equation with multiple explanatory variables?

— miura

@miura, One of the leitmotifs of that chapter in Mosteller & Tukey is that when the

x_{i}

$x_i$ are correlated, the partials

x_{i \cdot j}

$x_{i\cdot j}$ have low variances; because their variances appear in the denominator of a formula for the estimation variance of their coefficients, this implies the corresponding coefficients will have relatively uncertain estimates. That's a fact of the data, M&T say, and you need to recognize that. It makes no difference whether you start the regression with

x_{1}

$x_1$ or

x_{2}

$x_2$ : compare y.21 to y.12 in my code.

— whuber

I came across this today, here is what I think on the question by @miura, Think of a 2 dimensional space where Y is to be projected as a combination of two vectors. y = ax1 + bx2 + res (=0). Now think of y as a combination of 3 variables, y = ax1 + bx2 + cx3. and x3 = mx1 + nx2. so certainly, the order in which you choose your variables is going to effect the coefficients. The reason for this is: the minimum error here can be obtained by various combinations. However, in few examples, the minimum error can be obtained by only one combination and that is where the order will not matter.

— Gaurav Singhal

@whuber Can you elaborate on how this equation might be used for a multivariate regression that also has a constant term ? ie y = B1 * x1 + B2 * x2 + c ? It is not clear to me how the constant term can be derived. Also I understand in general what was done for the 2 variables, enough at least to replicate it in Excel. How can that be expanded to 3 variables ? x1, x2, x3. It seems clear that we would need to remove x3 first from y, x1, and x2. then remove x2 from x1 and y. But it is not clear to me how to then get the B3 term.

— Fairly Nerdy

I have answered some of my questions I have in the comment above. For a 3 variable regression, we would have 6 steps. Remove x1 from x2, from x3, and from y. Then remove x2,1 from x3,1 and from y1. Then remove x3,21 from y21. That results in 6 equations, each of which is of the form variable = lamda * different variable + residual. One of those equations has a y as the first variable, and if you just keep substituting the other variables in, you get the equation you need

— Fairly Nerdy