오늘 초부터 답변을 수정했습니다. 이제 코드를 실행할 예제 데이터를 생성했습니다. 다른 사람들은 당신이 내가 동의하는 캐럿 패키지를 사용하는 것을 옳게 제안했습니다. 그러나 경우에 따라 자체 코드를 작성해야 할 수도 있습니다. 아래에서는 R의 sample () 함수를 사용하여 교차 검증 배에 관측치를 무작위로 할당하는 방법을 보여 주려고 시도했습니다. 또한 10 개의 학습 세트에서 변수 사전 선택 (관용적 인 p 값 컷오프가 0.1 인 일 변량 선형 회귀 사용) 및 단계적 회귀 사용을 사용하여 변수 사전 선택을 수행하기 위해 루프를 사용합니다. 그런 다음 자체 코드를 작성하여 결과 모델을 유효성 검사 폴더에 적용 할 수 있습니다. 도움이 되었기를 바랍니다!
################################################################################
## Load the MASS library, which contains the "stepAIC" function for performing
## stepwise regression, to be used later in this script
library(MASS)
################################################################################
################################################################################
## Generate example data, with 100 observations (rows), 70 variables (columns 1
## to 70), and a continuous dependent variable (column 71)
Data <- NULL
Data <- as.data.frame(Data)
for (i in 1:71) {
for (j in 1:100) {
Data[j,i] <- rnorm(1) }}
names(Data)[71] <- "Dependent"
################################################################################
################################################################################
## Create ten folds for cross-validation. Each observation in your data will
## randomly be assigned to one of ten folds.
Data$Fold <- sample(c(rep(1:10,10)))
## Each fold will have the same number of observations assigned to it. You can
## double check this by typing the following:
table(Data$Fold)
## Note: If you were to have 105 observations instead of 100, you could instead
## write: Data$Fold <- sample(c(rep(1:10,10),rep(1:5,1)))
################################################################################
################################################################################
## I like to use a "for loop" for cross-validation. Here, prior to beginning my
## "for loop", I will define the variables I plan to use in it. You have to do
## this first or R will give you an error code.
fit <- NULL
stepw <- NULL
training <- NULL
testing <- NULL
Preselection <- NULL
Selected <- NULL
variables <- NULL
################################################################################
################################################################################
## Now we can begin the ten-fold cross validation. First, we open the "for loop"
for (CV in 1:10) {
## Now we define your training and testing folds. I like to store these data in
## a list, so at the end of the script, if I want to, I can go back and look at
## the observations in each individual fold
training[[CV]] <- Data[which(Data$Fold != CV),]
testing[[CV]] <- Data[which(Data$Fold == CV),]
## We can preselect variables by analyzing each variable separately using
## univariate linear regression and then ranking them by p value. First we will
## define the container object to which we plan to output these data.
Preselection[[CV]] <- as.data.frame(Preselection[CV])
## Now we will run a separate linear regression for each of our 70 variables.
## We will store the variable name and the coefficient p value in our object
## called "Preselection".
for (i in 1:70) {
Preselection[[CV]][i,1] <- i
Preselection[[CV]][i,2] <- summary(lm(Dependent ~ training[[CV]][,i] , data = training[[CV]]))$coefficients[2,4]
}
## Now we will remove "i" and also we will name the columns of our new object.
rm(i)
names(Preselection[[CV]]) <- c("Variable", "pValue")
## Now we will make note of those variables whose p values were less than 0.1.
Selected[[CV]] <- Preselection[[CV]][which(Preselection[[CV]]$pValue <= 0.1),] ; row.names(Selected[[CV]]) <- NULL
## Fit a model using the pre-selected variables to the training fold
## First we must save the variable names as a character string
temp <- NULL
for (k in 1:(as.numeric(length(Selected[[CV]]$Variable)))) {
temp[k] <- paste("training[[CV]]$V",Selected[[CV]]$Variable[k]," + ",sep="")}
variables[[CV]] <- paste(temp, collapse = "")
variables[[CV]] <- substr(variables[[CV]],1,(nchar(variables[[CV]])-3))
## Now we can use this string as the independent variables list in our model
y <- training[[CV]][,"Dependent"]
form <- as.formula(paste("y ~", variables[[CV]]))
## We can build a model using all of the pre-selected variables
fit[[CV]] <- lm(form, training[[CV]])
## Then we can build new models using stepwise removal of these variables using
## the MASS package
stepw[[CV]] <- stepAIC(fit[[CV]], direction="both")
## End for loop
}
## Now you have your ten training and validation sets saved as training[[CV]]
## and testing[[CV]]. You also have results from your univariate pre-selection
## analyses saved as Preselection[[CV]]. Those variables that had p values less
## than 0.1 are saved in Selected[[CV]]. Models built using these variables are
## saved in fit[[CV]]. Reduced versions of these models (by stepwise selection)
## are saved in stepw[[CV]].
## Now you might consider using the predict.lm function from the stats package
## to apply your ten models to their corresponding validation folds. You then
## could look at the performance of the ten models and average their performance
## statistics together to get an overall idea of how well your data predict the
## outcome.
################################################################################
교차 유효성 검사를 수행하기 전에 올바른 사용법에 대해 읽으십시오. 이 두 참고 문헌은 교차 검증에 대한 훌륭한 토론을 제공합니다.
- Simon RM, Subramanian J, Li MC, Menezes S. 교차 검증을 사용하여 고차원 데이터를 기반으로 생존 위험 분류기의 예측 정확도를 평가합니다. 간단한 생물 정보. 2011 년 5 월; 12 (3) : 203-14. Epub 2011 년 2 월 15 일. http://bib.oxfordjournals.org/content/12/3/203.long
- Richard Simon, Michael D. Radmacher, Kevin Dobbin 및 Lisa M. McShane. 진단 및 예후 분류를위한 DNA Microarray 데이터 사용의 함정. JNCI J Natl Cancer Inst (2003) 95 (1) : 14-18. http://jnci.oxfordjournals.org/content/95/1/14.long
이 논문은 생물 통계학자를 대상으로하지만 누구에게나 유용 할 것입니다.
또한 단계적 회귀를 사용하는 것은 위험합니다 (교차 유효성 검사를 사용하면 과적 합을 완화하는 데 도움이되지만). 단계별 회귀에 대한 자세한 내용은 http://www.stata.com/support/faqs/stat/stepwise.html에서 확인할 수 있습니다 .
추가 질문이 있으면 알려주세요!