Rはlmの欠損値をどのように処理しますか？

32

行列Aの各列に対してベクトルBを回帰させたいと思います。欠損データがない場合、これは簡単ですが、行列Aに欠損値が含まれている場合、Aに対する回帰はすべての行のみを含むように制限されます値が存在します（デフォルトのna.omit動作）。これにより、欠落データのない列に対して誤った結果が生成されます。列行列Bを行列Aの個々の列に対して回帰することはできますが、何千回も回帰を行う必要があり、これは非常に遅く、洗練されていません。na.exclude機能は、このような場合のために設計されているように見えるが、私はそれを動作させることはできません。ここで何が間違っていますか？重要な場合、OSXでR 2.13を使用します。

A = matrix(1:20, nrow=10, ncol=2)
B = matrix(1:10, nrow=10, ncol=1)
dim(lm(A~B)$residuals)
# [1] 10 2 (the expected 10 residual values)

# Missing value in first column; now we have 9 residuals
A[1,1] = NA  
dim(lm(A~B)$residuals)
#[1]  9 2 (the expected 9 residuals, given na.omit() is the default)

# Call lm with na.exclude; still have 9 residuals
dim(lm(A~B, na.action=na.exclude)$residuals)
#[1]  9 2 (was hoping to get a 10x2 matrix with a missing value here)

A.ex = na.exclude(A)
dim(lm(A.ex~B)$residuals)
# Throws an error because dim(A.ex)==9,2
#Error in model.frame.default(formula = A.ex ~ B, drop.unused.levels = TRUE) : 
#  variable lengths differ (found for 'B')

r missing-data linear-model

— デビッド・クイグリー
ソース

1

「各行を個別に計算できます」とはどういう意味ですか？

— -chl

申し訳ありませんが、「Aの列に対して列行列Bを個別に回帰できます」、つまりlmを1つずつ呼び出すことを意味します。これを反映するように編集されました。

— デビッドクイグリー

1

lm / regressionを一度に1つずつ呼び出すことは、回帰を実行するのに最適な方法ではありません（回帰の定義によると、他の状態が与えられた場合の応答/結果に対する各予測子の部分的な効果を見つけることです）変数）

— -KarthikS

23

編集：あなたの質問を誤解しました。2つの側面があります。

a) na.omit and na.exclude both do casewise deletion with respect to both predictors and criterions. They only differ in that extractor functions like residuals() or fitted() will pad their output with NAs for the omitted cases with na.exclude, thus having an output of the same length as the input variables.

> N    <- 20                               # generate some data
> y1   <- rnorm(N, 175, 7)                 # criterion 1
> y2   <- rnorm(N,  30, 8)                 # criterion 2
> x    <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor
> y1[c(1, 3,  5)] <- NA                    # some NA values
> y2[c(7, 9, 11)] <- NA                    # some other NA values
> Y    <- cbind(y1, y2)                    # matrix for multivariate regression
> fitO <- lm(Y ~ x, na.action=na.omit)     # fit with na.omit
> dim(residuals(fitO))                     # use extractor function
[1] 14  2

> fitE <- lm(Y ~ x, na.action=na.exclude)  # fit with na.exclude
> dim(residuals(fitE))                     # use extractor function -> = N
[1] 20  2

> dim(fitE$residuals)                      # access residuals directly
[1] 14  2

b) The real issue is not with this difference between na.omit and na.exclude, you don't seem to want casewise deletion that takes criterion variables into account, which both do.

> X <- model.matrix(fitE)                  # design matrix
> dim(X)                                   # casewise deletion -> only 14 complete cases
[1] 14  2

The regression results depend on the matrices $X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$ , coefficients $\hat{\beta} = X^{+} Y$ ) and the hat matrix $H = X X^{+}$ , fitted values $\hat{Y} = H Y$ ). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$ , so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of lm() by doing something along the lines of the following:

> Xf <- model.matrix(~ x)                    # full design matrix (all cases)
# function: manually calculate coefficients and fitted values for single criterion y
> getFit <- function(y) {
+     idx   <- !is.na(y)                     # throw away NAs
+     Xsvd  <- svd(Xf[idx , ])               # SVD decomposition of X
+     # get X+ but note: there might be better ways
+     Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ])
+     list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx]))
+ }

> res <- apply(Y, 2, getFit)    # get fits for each column of Y
> res$y1$coefs
                   [,1]
(Intercept) 113.9398761
x             0.7601234

> res$y2$coefs
                 [,1]
(Intercept) 91.580505
x           -0.805897

> coefficients(lm(y1 ~ x))      # compare with separate results from lm()
(Intercept)           x 
113.9398761   0.7601234 

> coefficients(lm(y2 ~ x))
(Intercept)           x 
  91.580505   -0.805897

Note that there might be numerically better ways to caculate $X^{+}$ and $H$ , you could check a $QR$ -decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using lm().

— caracal
ソース

That makes sense given my understanding of how na.exclude should work. However, if you call >X.both = cbind(X1, X2) and then >dim(lm(X.both~Y, na.action=na.exclude)$residuals) you still get 94 residuals, instead of 97 and 97.

— David Quigley

That's an improvement, but if you look at residuals(lm(X.both ~ Y, na.action=na.exclude)), you see that each column has six missing values, even though the missing values in column 1 of X.both are from different samples than those in column 2. So na.exclude is preserving the shape of the residuals matrix, but under the hood R is apparently only regressing with values present in all rows of X.both. There may be a good statistical reason for this, but for my application it's a problem.

— David Quigley

@David I had misunderstood your question. I think I now see your point, and have edited my answer to address it.

— caracal

5

I can think of two ways. One is combine the data use the na.exclude and then separate data again:

A = matrix(1:20, nrow=10, ncol=2)
colnames(A) <- paste("A",1:ncol(A),sep="")

B = matrix(1:10, nrow=10, ncol=1)
colnames(B) <- paste("B",1:ncol(B),sep="")

C <- cbind(A,B)

C[1,1] <- NA
C.ex <- na.exclude(C)

A.ex <- C[,colnames(A)]
B.ex <- C[,colnames(B)]

lm(A.ex~B.ex)

Another way is to use the data argument and create a formula.

Cd <- data.frame(C)
fr <- formula(paste("cbind(",paste(colnames(A),collapse=","),")~",paste(colnames(B),collapse="+"),sep=""))

lm(fr,data=Cd)

Cd[1,1] <-NA

lm(fr,data=Cd,na.action=na.exclude)

If you are doing a lot of regression the first way should be faster, since less background magic is performed. Although if you need only coefficients and residuals I suggest using lsfit, which is much faster than lm. The second way is a bit nicer, but on my laptop trying to do summary on the resulting regression throws an error. I will try to see whether this is a bug.

— mpiktas
ソース

Thanks, but lm(A.ex~B.ex) in your code fits 9 points against A1 (correct) and 9 points against A2 (undesired). There are 10 measured points for both B1 and A2; I'm throwing out one point in the regression of B1 against A2 because the corresponding point is missing in A1. If that's just The Way It Works I can accept that, but that's not what I'm trying to get R to do.

— David Quigley

@David, oh, it looks like I've misunderstood your problem. I'll post the fix later.

— mpiktas

1

The following example shows how to make predictions and residuals that conform to the original dataframe (using the "na.action=na.exclude" option in lm() to specify that NA's should be placed in the residual and prediction vectors where the original dataframe had missing values. It also shows how to specify whether predictions should include only observations where both explanatory and dependent variables were complete (i.e., strictly in-sample predictions) or observations where the explanatory variables were complete, and hence Xb prediction is possible, (i.e., including out-of-sample prediction for observations that had complete explanatory variables but were missing the dependent variable).

I use cbind to add the predicted and residual variables to the original dataset.

## Set up data with a linear model
N <- 10
NXmissing <- 2 
X <- runif(N, 0, 10)
Y <- 6 + 2*X + rnorm(N, 0, 1)
## Put in missing values (missing X, missing Y, missing both)
X[ sample(1:N , NXmissing) ] <- NA
Y[ sample(which(is.na(X)), 1)]  <- NA
Y[ sample(which(!is.na(X)), 1)]  <- NA
(my.df <- data.frame(X,Y))

## Run the regression with na.action specified to na.exclude
## This puts NA's in the residual and prediction vectors
my.lm  <- lm( Y ~ X, na.action=na.exclude, data=my.df)

## Predict outcome for observations with complete both explanatory and
## outcome variables, i.e. observations included in the regression
my.predict.insample  <- predict(my.lm)

## Predict outcome for observations with complete explanatory
## variables.  The newdata= option specifies the dataset on which
## to apply the coefficients
my.predict.inandout  <- predict(my.lm,newdata=my.df)

## Predict residuals 
my.residuals  <- residuals(my.lm)

## Make sure that it binds correctly
(my.new.df  <- cbind(my.df,my.predict.insample,my.predict.inandout,my.residuals))

## or in one fell swoop

(my.new.df  <- cbind(my.df,yhat=predict(my.lm),yhato=predict(my.lm,newdata=my.df),uhat=residuals(my.lm)))

— Michael Ash
ソース