2- 표본 비율 테스트에 lm 사용

12

선형 모델을 사용하여 잠시 동안 2- 표본 비율 테스트를 수행했지만 완전히 정확하지 않을 수도 있음을 깨달았습니다. 이항 군 + 항등 링크와 함께 일반화 선형 모형을 사용하면 풀링되지 않은 2- 표본 비율 테스트 결과가 정확하게 나타납니다. 그러나 선형 모델 (가우스 패밀리가있는 glm)을 사용하면 결과가 약간 다릅니다. 나는 이것이 R이 이항 대 가우시안 가족의 glm을 어떻게 해결하는지에 기인한다고 합리화하고 있지만 다른 원인이있을 수 있습니까?

## prop.test gives pooled 2-sample proportion result
## glm w/ binomial family gives unpooled 2-sample proportion result
## lm and glm w/ gaussian family give unknown result

library(dplyr)
library(broom)
set.seed(12345)

## set up dataframe -------------------------
n_A <- 5000
n_B <- 5000

outcome <- rbinom(
  n = n_A + n_B,
  size = 1,
  prob = 0.5
)
treatment <- c(
  rep("A", n_A),
  rep("B", n_B)
)

df <- tbl_df(data.frame(outcome = outcome, treatment = treatment))


## by hand, 2-sample prop tests ---------------------------------------------
p_A <- sum(df$outcome[df$treatment == "A"])/n_A
p_B <- sum(df$outcome[df$treatment == "B"])/n_B

p_pooled <- sum(df$outcome)/(n_A + n_B)
z_pooled <- (p_B - p_A) / sqrt( p_pooled * (1 - p_pooled) * (1/n_A + 1/n_B) )
pvalue_pooled <- 2*(1-pnorm(abs(z_pooled)))

z_unpooled <- (p_B - p_A) / sqrt( (p_A * (1 - p_A))/n_A + (p_B * (1 - p_B))/n_B )
pvalue_unpooled <- 2*(1-pnorm(abs(z_unpooled)))


## using prop.test --------------------------------------
res_prop_test <- tidy(prop.test(
  x = c(sum(df$outcome[df$treatment == "A"]), 
        sum(df$outcome[df$treatment == "B"])),
  n = c(n_A, n_B),
  correct = FALSE
))
res_prop_test # same as pvalue_pooled
all.equal(res_prop_test$p.value, pvalue_pooled)
# [1] TRUE


# using glm with identity link -----------------------------------
res_glm_binomial <- df %>%
  do(tidy(glm(outcome ~ treatment, family = binomial(link = "identity")))) %>%
  filter(term == "treatmentB")
res_glm_binomial # same as p_unpooled
all.equal(res_glm_binomial$p.value, pvalue_unpooled)
# [1] TRUE


## glm and lm gaussian --------------------------------

res_glm <- df %>%
  do(tidy(glm(outcome ~ treatment))) %>%
  filter(term == "treatmentB")
res_glm 
all.equal(res_glm$p.value, pvalue_unpooled)
all.equal(res_glm$p.value, pvalue_pooled)

res_lm <- df %>%
  do(tidy(lm(outcome ~ treatment))) %>% 
  filter(term == "treatmentB")
res_lm
all.equal(res_lm$p.value, pvalue_unpooled)
all.equal(res_lm$p.value, pvalue_pooled)

all.equal(res_lm$p.value, res_glm$p.value)
# [1] TRUE

r hypothesis-testing generalized-linear-model proportion

— 힐러리 파커
소스

8

모델 피팅에 해당하는 최적화 문제를 해결 하는 방법 과 관련이 없으며 모델이 제기하는 실제 최적화 문제와 관련이 있습니다.

특히 큰 표본에서는 두 개의 가중 최소 제곱 문제를 비교하는 것으로 효과적으로 간주 할 수 있습니다.

선형 모형 ( lm)은 비율의 분산이 일정하다고 가정합니다 (무가 중일 때). glm은 비율의 분산이 이항 가정 에서 온다고 가정 합니다. 이것은 데이터 포인트에 다른 가중치를 부여하기 때문에 다소 다른 추정치 *와 다른 차이의 차이가 발생합니다. $\text{Var}(\hat{p})=\text{Var}(X/n) = p(1-p)/n$

* 적어도 일부 상황에서는 비율을 정확히 비교할 필요는 없지만

— Glen_b-복귀 모니카
소스

0

계산 측면에서 lm 대 이항 glm에 대한 treatmentB 계수의 표준 오차를 비교하십시오. 이항 glm (z_unpooled의 분모)에서 treatmentB 계수의 표준 오차에 대한 공식이 있습니다. 표준 lm에서 처리 B 계수의 표준 오차는 (SE_lm)입니다.

    test = lm(outcome ~ treatment, data = df)
    treat_B =  as.numeric(df$treatment == "B")
    SE_lm = sqrt( sum(test$residuals^2)/(n_A+n_B-2) / 
              sum((treat_B - mean(treat_B))^2))

파생에 대해서는 이 게시물 을 참조하십시오. 여기서 차이는 대신 샘플 오류가 발견된다는 것입니다 (즉 , 자유도가 손실 된 경우 에서 2를 빼기 ). 이 없으면 lm 및 이항 glm 표준 오류는 실제로 때 일치하는 것 같습니다 . $\sigma^2$ $n_A+n_B$ $-2$ $n_A = n_B$

— jac
소스