R은 lm의 결 측값을 어떻게 처리합니까?

32

행렬 A의 각 열에 대해 벡터 B를 회귀하고 싶습니다. 결측 데이터가 없으면 사소한 일이지만 행렬 A에 결측 값이 포함되어 있으면 A에 대한 내 회귀는 모든 행이 포함되도록 제한됩니다. 값이 존재합니다 (기본 na.omit 동작). 누락 된 데이터가없는 열에 대해 잘못된 결과가 생성됩니다. 행렬 A의 개별 열에 대해 열 행렬 B를 회귀시킬 수는 있지만 수천 번의 회귀를해야하며 이는 엄청나게 느리고 우아하지 않습니다. na.exclude 함수는이 사건에 대한 설계 것 같다,하지만 난 그것을 작업을 할 수 없습니다. 내가 여기서 뭘 잘못하고 있니? 중요한 경우 OSX에서 R 2.13 사용

A = matrix(1:20, nrow=10, ncol=2)
B = matrix(1:10, nrow=10, ncol=1)
dim(lm(A~B)$residuals)
# [1] 10 2 (the expected 10 residual values)

# Missing value in first column; now we have 9 residuals
A[1,1] = NA  
dim(lm(A~B)$residuals)
#[1]  9 2 (the expected 9 residuals, given na.omit() is the default)

# Call lm with na.exclude; still have 9 residuals
dim(lm(A~B, na.action=na.exclude)$residuals)
#[1]  9 2 (was hoping to get a 10x2 matrix with a missing value here)

A.ex = na.exclude(A)
dim(lm(A.ex~B)$residuals)
# Throws an error because dim(A.ex)==9,2
#Error in model.frame.default(formula = A.ex ~ B, drop.unused.levels = TRUE) : 
#  variable lengths differ (found for 'B')

r missing-data linear-model

— 데이비드 퀴 글리
소스

1

"각 행을 개별적으로 계산할 수 있습니다"는 무슨 뜻입니까?

— chl

죄송합니다. "한 번에 하나씩 lm을 호출한다는 것을 의미하는"A의 열에 대해 열 행렬 B를 되돌릴 수 있습니다 "라고 말해야합니다. 이것을 반영하도록 수정되었습니다.

— David Quigley

1

lm / regression에 대한 한 번에 한 번의 호출은 회귀를 수행하는 좋은 방법이 아닙니다 (회귀의 정의에 따라 진행됨). 변수)

— KarthikS

23

편집 : 귀하의 질문을 오해했습니다. 두 가지 측면이 있습니다.

A) na.omit와 na.exclude모두 예측하고 criterions에 모두에 대한 casewise 삭제를 할. 추출기는 기능이 다르거 residuals()나 생략 된 경우에 대해 fitted()출력을 NAs로 채 웁니다.na.exclude 입력 변수와 동일한 길이의 출력을 갖는 .

> N    <- 20                               # generate some data
> y1   <- rnorm(N, 175, 7)                 # criterion 1
> y2   <- rnorm(N,  30, 8)                 # criterion 2
> x    <- 0.5*y1 - 0.3*y2 + rnorm(N, 0, 3) # predictor
> y1[c(1, 3,  5)] <- NA                    # some NA values
> y2[c(7, 9, 11)] <- NA                    # some other NA values
> Y    <- cbind(y1, y2)                    # matrix for multivariate regression
> fitO <- lm(Y ~ x, na.action=na.omit)     # fit with na.omit
> dim(residuals(fitO))                     # use extractor function
[1] 14  2

> fitE <- lm(Y ~ x, na.action=na.exclude)  # fit with na.exclude
> dim(residuals(fitE))                     # use extractor function -> = N
[1] 20  2

> dim(fitE$residuals)                      # access residuals directly
[1] 14  2

b)는 실제 문제 사이의 차이가되지 않습니다 na.omit및 na.exclude둘 다 할 계정에 기준 변수를 취 casewise 삭제를 원하는 것 같지 않습니다.

> X <- model.matrix(fitE)                  # design matrix
> dim(X)                                   # casewise deletion -> only 14 complete cases
[1] 14  2

$X^{+} = (X' X)^{-1} X'$ (pseudoinverse of design matrix $X$ , coefficients $\hat{\beta} = X^{+} Y$ ) and the hat matrix $H = X X^{+}$ , fitted values $\hat{Y} = H Y$ ). If you don't want casewise deletion, you need a different design matrix $X$ for each column of $Y$ , so there's no way around fitting separate regressions for each criterion. You can try to avoid the overhead of lm() by doing something along the lines of the following:

> Xf <- model.matrix(~ x)                    # full design matrix (all cases)
# function: manually calculate coefficients and fitted values for single criterion y
> getFit <- function(y) {
+     idx   <- !is.na(y)                     # throw away NAs
+     Xsvd  <- svd(Xf[idx , ])               # SVD decomposition of X
+     # get X+ but note: there might be better ways
+     Xplus <- tcrossprod(Xsvd$v %*% diag(Xsvd$d^(-2)) %*% t(Xsvd$v), Xf[idx, ])
+     list(coefs=(Xplus %*% y[idx]), yhat=(Xf[idx, ] %*% Xplus %*% y[idx]))
+ }

> res <- apply(Y, 2, getFit)    # get fits for each column of Y
> res$y1$coefs
                   [,1]
(Intercept) 113.9398761
x             0.7601234

> res$y2$coefs
                 [,1]
(Intercept) 91.580505
x           -0.805897

> coefficients(lm(y1 ~ x))      # compare with separate results from lm()
(Intercept)           x 
113.9398761   0.7601234 

> coefficients(lm(y2 ~ x))
(Intercept)           x 
  91.580505   -0.805897

Note that there might be numerically better ways to caculate $X^{+}$ and $H$ , you could check a $QR$ -decomposition instead. The SVD-approach is explained here on SE. I have not timed the above approach with big matrices $Y$ against actually using lm().

— caracal
소스

That makes sense given my understanding of how na.exclude should work. However, if you call >X.both = cbind(X1, X2) and then >dim(lm(X.both~Y, na.action=na.exclude)$residuals) you still get 94 residuals, instead of 97 and 97.

— David Quigley

That's an improvement, but if you look at residuals(lm(X.both ~ Y, na.action=na.exclude)), you see that each column has six missing values, even though the missing values in column 1 of X.both are from different samples than those in column 2. So na.exclude is preserving the shape of the residuals matrix, but under the hood R is apparently only regressing with values present in all rows of X.both. There may be a good statistical reason for this, but for my application it's a problem.

— David Quigley

@David I had misunderstood your question. I think I now see your point, and have edited my answer to address it.

— caracal

5

I can think of two ways. One is combine the data use the na.exclude and then separate data again:

A = matrix(1:20, nrow=10, ncol=2)
colnames(A) <- paste("A",1:ncol(A),sep="")

B = matrix(1:10, nrow=10, ncol=1)
colnames(B) <- paste("B",1:ncol(B),sep="")

C <- cbind(A,B)

C[1,1] <- NA
C.ex <- na.exclude(C)

A.ex <- C[,colnames(A)]
B.ex <- C[,colnames(B)]

lm(A.ex~B.ex)

Another way is to use the data argument and create a formula.

Cd <- data.frame(C)
fr <- formula(paste("cbind(",paste(colnames(A),collapse=","),")~",paste(colnames(B),collapse="+"),sep=""))

lm(fr,data=Cd)

Cd[1,1] <-NA

lm(fr,data=Cd,na.action=na.exclude)

If you are doing a lot of regression the first way should be faster, since less background magic is performed. Although if you need only coefficients and residuals I suggest using lsfit, which is much faster than lm. The second way is a bit nicer, but on my laptop trying to do summary on the resulting regression throws an error. I will try to see whether this is a bug.

— mpiktas
소스

Thanks, but lm(A.ex~B.ex) in your code fits 9 points against A1 (correct) and 9 points against A2 (undesired). There are 10 measured points for both B1 and A2; I'm throwing out one point in the regression of B1 against A2 because the corresponding point is missing in A1. If that's just The Way It Works I can accept that, but that's not what I'm trying to get R to do.

— David Quigley

@David, oh, it looks like I've misunderstood your problem. I'll post the fix later.

— mpiktas

1

The following example shows how to make predictions and residuals that conform to the original dataframe (using the "na.action=na.exclude" option in lm() to specify that NA's should be placed in the residual and prediction vectors where the original dataframe had missing values. It also shows how to specify whether predictions should include only observations where both explanatory and dependent variables were complete (i.e., strictly in-sample predictions) or observations where the explanatory variables were complete, and hence Xb prediction is possible, (i.e., including out-of-sample prediction for observations that had complete explanatory variables but were missing the dependent variable).

I use cbind to add the predicted and residual variables to the original dataset.

## Set up data with a linear model
N <- 10
NXmissing <- 2 
X <- runif(N, 0, 10)
Y <- 6 + 2*X + rnorm(N, 0, 1)
## Put in missing values (missing X, missing Y, missing both)
X[ sample(1:N , NXmissing) ] <- NA
Y[ sample(which(is.na(X)), 1)]  <- NA
Y[ sample(which(!is.na(X)), 1)]  <- NA
(my.df <- data.frame(X,Y))

## Run the regression with na.action specified to na.exclude
## This puts NA's in the residual and prediction vectors
my.lm  <- lm( Y ~ X, na.action=na.exclude, data=my.df)

## Predict outcome for observations with complete both explanatory and
## outcome variables, i.e. observations included in the regression
my.predict.insample  <- predict(my.lm)

## Predict outcome for observations with complete explanatory
## variables.  The newdata= option specifies the dataset on which
## to apply the coefficients
my.predict.inandout  <- predict(my.lm,newdata=my.df)

## Predict residuals 
my.residuals  <- residuals(my.lm)

## Make sure that it binds correctly
(my.new.df  <- cbind(my.df,my.predict.insample,my.predict.inandout,my.residuals))

## or in one fell swoop

(my.new.df  <- cbind(my.df,yhat=predict(my.lm),yhato=predict(my.lm,newdata=my.df),uhat=residuals(my.lm)))

— Michael Ash
소스