R에서 변수 / 기능 선택을 수행하기 위해 교차 검증을 사용하는 방법이 있습니까?

10

약 70 개의 변수가있는 데이터 세트가 있습니다. 내가 찾고있는 것은 CV를 사용하여 다음과 같은 방식으로 가장 유용한 변수를 찾는 것입니다.

1) 20 개의 변수를 임의로 선택합니다.

2) stepwise/ LASSO/ lars/ etc를 사용 하여 가장 중요한 변수를 선택하십시오.

3) ~ 50x를 반복하고 어떤 변수가 가장 자주 선택 (제거되지 않음)되었는지 확인합니다.

이것은 randomForest할 일 의 선을 따르지 만 rfVarSel패키지는 요인 / 분류에만 작동하는 것으로 보이며 연속 종속 변수를 예측해야합니다.

R을 사용하고 있으므로 제안 사항을 이상적으로 구현할 수 있습니다.

— 부엉이
소스

모든 기능이 중요합니까? 당신은 얼마나 많은 샘플을 가지고 있습니까? 문제를 올바르게 이해하면 부스팅의 변형을 시도해 볼 수 있습니다. 샘플의 하위 집합을 반복적으로 선택하고 모든 변수를 해당 변수에 맞추고 어떤 변수가 더 자주 팝업되는지 확인하십시오.

— Ofelia

1

나는 RAS (예 : glmnet 및 penalized)에서 구현이 기본적으로 "최적의"정규화 매개 변수를 찾기 위해 교차 검증을 수행하는 LASSO에서 절차가 개선되지 않을 것이라고 생각합니다. 고려할 수있는 한 가지 가능성은이 매개 변수에 대해 LASSO 검색을 여러 번 반복하여 잠재적으로 큰 교차 검증 편차 (반복 CV)에 대처하는 것입니다. 물론 어떤 알고리즘도 과목 별 사전 지식을 이길 수 없습니다.

— miura

9

나는 당신이 묘사 한 것이 이미 caret패키지에 구현되어 있다고 생각 합니다. 상기 봐 rfe여기에 기능이나 네트 : http://cran.r-project.org/web/packages/caret/vignettes/caretSelection.pdf

이제, 기능 수를 줄여야하는 이유는 무엇입니까? 70에서 20까지는 실제로 차수가 줄어드는 것은 아닙니다. 일부 기능이 실제로 중요하지 않다고 믿기 전에 70 개 이상의 기능이 필요하다고 생각합니다. 그러나 다시, 그것은 주관적인 사전이 오는 곳이라고 생각합니다.

— 시어 파크
소스

5

변수 선택 빈도가 초기 모델에서 변수의 명백한 중요성에서 아직 얻지 못한 정보를 제공하는 이유는 없습니다. 이것은 본질적으로 초기 통계적 중요성을 재현 한 것입니다. 또한 선택 빈도에 대한 컷오프를 결정할 때 새로운 수준의 임의성도 추가됩니다. 리샘플링 변수 선택은 다른 문제 외에도 공선성에 의해 심하게 손상됩니다.

— 프랭크 하렐
소스

2

오늘 초부터 답변을 수정했습니다. 이제 코드를 실행할 예제 데이터를 생성했습니다. 다른 사람들은 당신이 내가 동의하는 캐럿 패키지를 사용하는 것을 옳게 제안했습니다. 그러나 경우에 따라 자체 코드를 작성해야 할 수도 있습니다. 아래에서는 R의 sample () 함수를 사용하여 교차 검증 배에 관측치를 무작위로 할당하는 방법을 보여 주려고 시도했습니다. 또한 10 개의 학습 세트에서 변수 사전 선택 (관용적 인 p 값 컷오프가 0.1 인 일 변량 선형 회귀 사용) 및 단계적 회귀 사용을 사용하여 변수 사전 선택을 수행하기 위해 루프를 사용합니다. 그런 다음 자체 코드를 작성하여 결과 모델을 유효성 검사 폴더에 적용 할 수 있습니다. 도움이 되었기를 바랍니다!

################################################################################
## Load the MASS library, which contains the "stepAIC" function for performing
## stepwise regression, to be used later in this script
library(MASS)
################################################################################


################################################################################
## Generate example data, with 100 observations (rows), 70 variables (columns 1
## to 70), and a continuous dependent variable (column 71)
Data <- NULL
Data <- as.data.frame(Data)

for (i in 1:71) {
for (j in 1:100) {
Data[j,i]  <- rnorm(1) }}

names(Data)[71] <- "Dependent"
################################################################################


################################################################################
## Create ten folds for cross-validation. Each observation in your data will
## randomly be assigned to one of ten folds.
Data$Fold <- sample(c(rep(1:10,10)))

## Each fold will have the same number of observations assigned to it. You can
## double check this by typing the following:
table(Data$Fold)

## Note: If you were to have 105 observations instead of 100, you could instead
## write: Data$Fold <- sample(c(rep(1:10,10),rep(1:5,1)))
################################################################################


################################################################################
## I like to use a "for loop" for cross-validation. Here, prior to beginning my
## "for loop", I will define the variables I plan to use in it. You have to do
## this first or R will give you an error code.
fit <- NULL
stepw <- NULL
training <- NULL
testing <- NULL
Preselection <- NULL
Selected <- NULL
variables <- NULL
################################################################################


################################################################################
## Now we can begin the ten-fold cross validation. First, we open the "for loop"
for (CV in 1:10) {

## Now we define your training and testing folds. I like to store these data in
## a list, so at the end of the script, if I want to, I can go back and look at
## the observations in each individual fold
training[[CV]] <- Data[which(Data$Fold != CV),]
testing[[CV]]  <- Data[which(Data$Fold == CV),]

## We can preselect variables by analyzing each variable separately using
## univariate linear regression and then ranking them by p value. First we will
## define the container object to which we plan to output these data.
Preselection[[CV]] <- as.data.frame(Preselection[CV])

## Now we will run a separate linear regression for each of our 70 variables.
## We will store the variable name and the coefficient p value in our object
## called "Preselection".
for (i in 1:70) {
Preselection[[CV]][i,1]  <- i
Preselection[[CV]][i,2]  <- summary(lm(Dependent ~ training[[CV]][,i] , data = training[[CV]]))$coefficients[2,4]
}

## Now we will remove "i" and also we will name the columns of our new object.
rm(i)
names(Preselection[[CV]]) <- c("Variable", "pValue")

## Now we will make note of those variables whose p values were less than 0.1.
Selected[[CV]] <- Preselection[[CV]][which(Preselection[[CV]]$pValue <= 0.1),] ; row.names(Selected[[CV]]) <- NULL

## Fit a model using the pre-selected variables to the training fold
## First we must save the variable names as a character string
temp <- NULL
for (k in 1:(as.numeric(length(Selected[[CV]]$Variable)))) {
temp[k] <- paste("training[[CV]]$V",Selected[[CV]]$Variable[k]," + ",sep="")}
variables[[CV]] <- paste(temp, collapse = "")
variables[[CV]] <- substr(variables[[CV]],1,(nchar(variables[[CV]])-3))

## Now we can use this string as the independent variables list in our model
y <- training[[CV]][,"Dependent"]
form <- as.formula(paste("y ~", variables[[CV]]))

## We can build a model using all of the pre-selected variables
fit[[CV]] <- lm(form, training[[CV]])

## Then we can build new models using stepwise removal of these variables using
## the MASS package
stepw[[CV]] <- stepAIC(fit[[CV]], direction="both")

## End for loop
}

## Now you have your ten training and validation sets saved as training[[CV]]
## and testing[[CV]]. You also have results from your univariate pre-selection
## analyses saved as Preselection[[CV]]. Those variables that had p values less
## than 0.1 are saved in Selected[[CV]]. Models built using these variables are
## saved in fit[[CV]]. Reduced versions of these models (by stepwise selection)
## are saved in stepw[[CV]].

## Now you might consider using the predict.lm function from the stats package
## to apply your ten models to their corresponding validation folds. You then
## could look at the performance of the ten models and average their performance
## statistics together to get an overall idea of how well your data predict the
## outcome.
################################################################################

교차 유효성 검사를 수행하기 전에 올바른 사용법에 대해 읽으십시오. 이 두 참고 문헌은 교차 검증에 대한 훌륭한 토론을 제공합니다.

Simon RM, Subramanian J, Li MC, Menezes S. 교차 검증을 사용하여 고차원 데이터를 기반으로 생존 위험 분류기의 예측 정확도를 평가합니다. 간단한 생물 정보. 2011 년 5 월; 12 (3) : 203-14. Epub 2011 년 2 월 15 일. http://bib.oxfordjournals.org/content/12/3/203.long
Richard Simon, Michael D. Radmacher, Kevin Dobbin 및 Lisa M. McShane. 진단 및 예후 분류를위한 DNA Microarray 데이터 사용의 함정. JNCI J Natl Cancer Inst (2003) 95 (1) : 14-18. http://jnci.oxfordjournals.org/content/95/1/14.long

이 논문은 생물 통계학자를 대상으로하지만 누구에게나 유용 할 것입니다.

또한 단계적 회귀를 사용하는 것은 위험합니다 (교차 유효성 검사를 사용하면 과적 합을 완화하는 데 도움이되지만). 단계별 회귀에 대한 자세한 내용은 http://www.stata.com/support/faqs/stat/stepwise.html에서 확인할 수 있습니다 .

추가 질문이 있으면 알려주세요!

— 알렉산더
소스

0

나는 여기에 좋은 것을 발견했다 : http://cran.r-project.org/web/packages/Causata/vignettes/Causata-vignette.pdf

glmnet 패키지를 사용할 때 이것을 시도하십시오

# extract nonzero coefficients
coefs.all <- as.matrix(coef(cv.glmnet.obj, s="lambda.min"))
idx <- as.vector(abs(coefs.all) > 0)
coefs.nonzero <- as.matrix(coefs.all[idx])
rownames(coefs.nonzero) <- rownames(coefs.all)[idx]

— 사이먼 닐스
소스