QQ 플롯을 해석하는 방법

172

작은 데이터 세트 (21 개의 관찰)로 작업하고 있으며 R에 다음과 같은 정상적인 QQ 플롯이 있습니다.

여기에 이미지 설명을 입력하십시오

음모가 정규성을 지원하지 않는 경우 기본 분포에 대해 무엇을 추론 할 수 있습니까? 오른쪽으로 치우친 분포가 더 적합 할 것 같습니다. 맞습니까? 또한 데이터에서 다른 결론을 도출 할 수 있습니까?

r data-visualization inference qq-plot

— 존
소스

9

올바른 왜곡을 나타내는 것이 맞습니다. QQ 플롯 해석에 대한 게시물 중 일부를 찾으려고 노력합니다.

— Glen_b

3

결론을 내릴 필요는 없습니다. 다음에 무엇을 시도해야할지 결정하면됩니다. 여기서 나는 제곱근 또는 데이터 로깅을 고려할 것입니다.

— Nick Cox

11

Tukey의 Three-Point Method 는 QQ 플롯을 사용하여 변수를 대략 정상으로 만드는 방법을 식별하는 데 도움이됩니다. 예를 들어 (그림에서 , 및 추정) 꼬리와 중간 점의 끝점 을 선택하면 쉽게 찾을 수 있습니다. 제곱근은 선형화에 가깝습니다. 따라서 기초 분포가 대략 제곱근 법선이라고 추측 할 수 있습니다.

(- 1.5, 2)

$(-1.5,2)$

(1.5, 220)

$(1.5,220)$

(0, 70)

$(0,70)$

— whuber

3

@Glen_b 내 질문에 대한 답변에는 stats.stackexchange.com/questions/71065/…라는 정보 가 있으며 답변 링크에는 또 다른 좋은 출처가 있습니다 : stats.stackexchange.com/questions/52212/qq-plot-does-not -match 히스토그램

— tpg2114

이건 뭐야? QQ 플롯에 비공식적으로 분산 된 데이터가 표시됩니까? ! 이미지 설명을 입력하십시오

— David

292

값이 선을 따라 분포하면 분포는 우리가 가정 한 이론적 분포와 동일한 모양 (위치 및 스케일까지)입니다.

지역 거동 : y 축에서 정렬 된 샘플 값과 x 축에서 (대략적인) 예상 Quantile을 볼 때, 플롯의 일부 섹션에있는 값이 전체 선형 추세와 로컬에서 어떻게 다른지 확인할 수 있습니다. 값은 플롯의 해당 섹션에서 가정 한 이론적 분포보다 다소 집중되어 있습니다.

보시다시피, 집중도가 낮을수록 전체 선형 관계가 제안하는 것보다 덜 빠르게 증가하는 것으로 추정되는 것보다 집중도가 높아지고 집중도가 높아지며 극단적 인 경우 샘플 밀도의 차이에 해당합니다 (가로에 가까운 수직 점프로 표시). 또는 상수 값의 급상승 (수평으로 정렬 된 값). 이를 통해 우리는 두꺼운 꼬리 나 밝은 꼬리를 발견 할 수 있으며, 따라서 이론적 분포보다 크거나 작은 왜곡 등을 할 수 있습니다.

전반적인 외관 :

다음은 QQ- 플롯이 평균적으로 보이는 분포입니다 (특히 분포 선택) .

여기에 이미지 설명을 입력하십시오

그러나 무작위성은 특히 작은 표본의 경우 모호한 경향이 있습니다.

여기에 이미지 설명을 입력하십시오

$n=21$

여기에 이미지 설명을 입력하십시오

$n=21$

또한 제안을 찾을 수 있습니다 여기에 당신이 곡률 또는 wiggliness의 특정 금액에 대해 걱정해야 얼마나 많은 결정하려고 할 때 유용합니다.

일반적으로 해석에 더 적합한 안내서에는 더 작고 더 큰 샘플 크기의 디스플레이도 포함됩니다.

— 글렌 _b
소스

18

이것은 매우 실용적인 안내서입니다. 모든 정보를 수집 해 주셔서 감사합니다.

— JohnK

4

나는 그것이 선형성에서 벗어난 모양과 유형의 편차라는 것을 이해하지만 여전히 두 축이 "... Quantiles"로 표시되고 한 축이 0.2 0.4 0.6으로 가고 다른 축이 -2 -1 0으로가는 것이 이상하게 보입니다. 1 2. 다시 말하지만 일부 데이터 포인트는 이론적 분포의 중간 40 % 내에 있지만 가장 오른쪽 아래 그림의 y 축에서 알 수 있듯이 어떻게 자신의 분포의 3 % 사이에 분포시킬 수 있습니까?

— Macond

2

@Macond y 축은 Quantile이 아니라 데이터의 원시 값을 보여줍니다 . 나는 y 축을 표준화하면 훨씬 명확 해지 며 R이 기본적으로 이것을하지 않는 이유를 모르겠습니다. 누군가 이것에 약간의 빛을 비출 수 있습니까?

— Gordon Gustafson

4

Macond에 대한 첫 번째 의견과 관련하여 @GordonGustafson은 QQ 플롯이 데이터를 표시하기 때문에 데이터를 표준화하지 않는 데는 매우 좋은 이유가 있습니다 ! 함수에 제공하는 데이터에 정보를 표시하도록 설계되었습니다 (박스 플롯 또는 히스토그램에 제공하는 데이터를 표준화하는 것이 좋습니다). 데이터를 변환하면 더 이상 데이터가 표시되지 않습니다 (플롯의 모양이 비슷하더라도 더 이상 플롯의 위치 나 스케일이 표시되지 않음). 표준화 된 줄거리에서 더 명확하다고 생각하는 것이 확실하지 않습니다. 명확히 할 수 있습니까?

— Glen_b

2

@ ZiyaoWei No, 유니폼은 실제로 꼬리가 매우 가볍습니다. 모든 것은 센터의 2 MAD 내에 있습니다. 이 답변 의 첫 번째 단락은 '무거운 꼬리'의 의미에 대해 명확하고 일반적으로 생각할 수있는 방법을 제공합니다.

— Glen_b

63

정상적인 QQ 플롯을 해석하는 데 도움이되는 반짝이는 앱을 만들었습니다. 이 링크를 사용해보십시오 .

이 앱에서 데이터의 왜도, 꼬리 (커토 시스) 및 양식을 조정할 수 있으며 히스토그램 및 QQ 플롯이 어떻게 변하는 지 확인할 수 있습니다. 반대로 QQ 플롯 패턴을 사용하는 방식으로 사용할 수 있으며 왜도 등이 있는지 확인하십시오.

자세한 내용은 설명서를 참조하십시오.

이 앱을 온라인으로 제공하기에 충분한 여유 공간이 없다는 것을 깨달았습니다. : 요청, 나는 세 가지 코드 청크를 제공 할 것입니다 sample.R, server.R그리고 ui.R여기. 이 앱을 실행하는 데 관심이있는 사용자는이 파일을 Rstudio에로드 한 다음 자신의 PC에서 실행할 수 있습니다.

sample.R파일 :

# Compute the positive part of a real number x, which is $\max(x, 0)$.
positive_part <- function(x) {ifelse(x > 0, x, 0)}

# This function generates n data points from some unimodal population.
# Input: ----------------------------------------------------
# n: sample size;
# mu: the mode of the population, default value is 0.
# skewness: the parameter that reflects the skewness of the distribution, note it is not
#           the exact skewness defined in statistics textbook, the default value is 0.
# tailedness: the parameter that reflects the tailedness of the distribution, note it is
#             not the exact kurtosis defined in textbook, the default value is 0.

# When all arguments take their default values, the data will be generated from standard 
# normal distribution.

random_sample <- function(n, mu = 0, skewness = 0, tailedness = 0){
  sigma = 1

  # The sampling scheme resembles the rejection sampling. For each step, an initial data point
  # was proposed, and it will be rejected or accepted based on the weights determined by the
  # skewness and tailedness of input. 
  reject_skewness <- function(x){
      scale = 1
      # if `skewness` > 0 (means data are right-skewed), then small values of x will be rejected
      # with higher probability.
      l <- exp(-scale * skewness * x)
      l/(1 + l)
  }

  reject_tailedness <- function(x){
      scale = 1
      # if `tailedness` < 0 (means data are lightly-tailed), then big values of x will be rejected with
      # higher probability.
      l <- exp(-scale * tailedness * abs(x))
      l/(1 + l)
  }

  # w is another layer option to control the tailedness, the higher the w is, the data will be
  # more heavily-tailed. 
  w = positive_part((1 - exp(-0.5 * tailedness)))/(1 + exp(-0.5 * tailedness))

  filter <- function(x){
    # The proposed data points will be accepted only if it satified the following condition, 
    # in which way we controlled the skewness and tailedness of data. (For example, the 
    # proposed data point will be rejected more frequently if it has higher skewness or
    # tailedness.)
    accept <- runif(length(x)) > reject_tailedness(x) * reject_skewness(x)
    x[accept]
  }

  result <- filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5)))
  # Keep generating data points until the length of data vector reaches n.
  while (length(result) < n) {
    result <- c(result, filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5))))
  }
  result[1:n]
}

multimodal <- function(n, Mu, skewness = 0, tailedness = 0) {
  # Deal with the bimodal case.
  mumu <- as.numeric(Mu %*% rmultinom(n, 1, rep(1, length(Mu))))
  mumu + random_sample(n, skewness = skewness, tailedness = tailedness)
}

server.R파일 :

library(shiny)
# Need 'ggplot2' package to get a better aesthetic effect.
library(ggplot2)

# The 'sample.R' source code is used to generate data to be plotted, based on the input skewness, 
# tailedness and modality. For more information, see the source code in 'sample.R' code.
source("sample.R")

shinyServer(function(input, output) {
  # We generate 10000 data points from the distribution which reflects the specification of skewness,
  # tailedness and modality. 
  n = 10000

  # 'scale' is a parameter that controls the skewness and tailedness.
  scale = 1000

  # The `reactive` function is a trick to accelerate the app, which enables us only generate the data
  # once to plot two plots. The generated sample was stored in the `data` object to be called later.
  data <- reactive({
    # For `Unimodal` choice, we fix the mode at 0.
    if (input$modality == "Unimodal") {mu = 0}

    # For `Bimodal` choice, we fix the two modes at -2 and 2.
    if (input$modality == "Bimodal") {mu = c(-2, 2)}

    # Details will be explained in `sample.R` file.
    sample1 <- multimodal(n, mu, skewness = scale * input$skewness, tailedness = scale * input$kurtosis)
    data.frame(x = sample1)})

  output$histogram <- renderPlot({
    # Plot the histogram.
    ggplot(data(), aes(x = x)) + 
      geom_histogram(aes(y = ..density..), binwidth = .5, colour = "black", fill = "white") + 
      xlim(-6, 6) +
      # Overlay the density curve.
      geom_density(alpha = .5, fill = "blue") + ggtitle("Histogram of Data") + 
      theme(plot.title = element_text(lineheight = .8, face = "bold"))
  })

  output$qqplot <- renderPlot({
    # Plot the QQ plot.
    ggplot(data(), aes(sample = x)) + stat_qq() + ggtitle("QQplot of Data") + 
      theme(plot.title = element_text(lineheight=.8, face = "bold"))
    })
})

마지막으로 ui.R파일 :

library(shiny)

# Define UI for application that helps students interpret the pattern of (normal) QQ plots. 
# By using this app, we can show students the different patterns of QQ plots (and the histograms,
# for completeness) for different type of data distributions. For example, left skewed heavy tailed
# data, etc. 

# This app can be (and is encouraged to be) used in a reversed way, namely, show the QQ plot to the 
# students first, then tell them based on the pattern of the QQ plot, the data is right skewed, bimodal,
# heavy-tailed, etc.


shinyUI(fluidPage(
  # Application title
  titlePanel("Interpreting Normal QQ Plots"),

  sidebarLayout(
    sidebarPanel(
      # The first slider can control the skewness of input data. "-1" indicates the most left-skewed 
      # case while "1" indicates the most right-skewed case.
      sliderInput("skewness", "Skewness", min = -1, max = 1, value = 0, step = 0.1, ticks = FALSE),

      # The second slider can control the skewness of input data. "-1" indicates the most light tail
      # case while "1" indicates the most heavy tail case.
      sliderInput("kurtosis", "Tailedness", min = -1, max = 1, value = 0, step = 0.1, ticks = FALSE),

      # This selectbox allows user to choose the number of modes of data, two options are provided:
      # "Unimodal" and "Bimodal".
      selectInput("modality", label = "Modality", 
                  choices = c("Unimodal" = "Unimodal", "Bimodal" = "Bimodal"),
                  selected = "Unimodal"),
      br(),
      # The following helper information will be shown on the user interface to give necessary
      # information to help users understand sliders.
      helpText(p("The skewness of data is controlled by moving the", strong("Skewness"), "slider,", 
               "the left side means left skewed while the right side means right skewed."), 
               p("The tailedness of data is controlled by moving the", strong("Tailedness"), "slider,", 
                 "the left side means light tailed while the right side means heavy tailedd."),
               p("The modality of data is controlledy by selecting the modality from", strong("Modality"),
                 "select box.")
               )
  ),

  # The main panel outputs two plots. One plot is the histogram of data (with the nonparamteric density
  # curve overlaid), to get a better visualization, we restricted the range of x-axis to -6 to 6 so 
  # that part of the data will not be shown when heavy-tailed input is chosen. The other plot is the 
  # QQ plot of data, as convention, the x-axis is the theoretical quantiles for standard normal distri-
  # bution and the y-axis is the sample quantiles of data. 
  mainPanel(
    plotOutput("histogram"),
    plotOutput("qqplot")
  )
)
)
)

— 잔 시옹
소스

1

Shiny 앱의 용량이 최대 한 것 같습니다. 어쩌면 당신은 단지 코드를 제공 할 수

— rsoren

1

@rsoren은 덧붙여서 도움이 되길 희망하며 제안을 기다리겠습니다.

— Zhanxiong

아주 좋아요! 샘플 크기와 임의의 정도를 변경하는 옵션도 추가하는 것이 좋습니다.

— Itamar

링크를 사용할 수 없습니다 !!!! @Zhanxiong

— Alireza Sanaee

매월 제한된 수의 클릭으로 링크가 응답하지 않는 것 같습니다. 이것이 내가 소스 코드를 여기에 붙여 넣은 이유입니다 (귀하와 동일한 문제가 발생한 다른 사용자의 요청에 따라). 필요한 패키지를 미리로드 한 후 R 스튜디오에 붙여 넣어 자신의 PC에서 실행할 수 있습니다.

— Zhanxiong

6

교수님은 매우 유용하고 직관적 인 설명을 제공합니다. MIT MOOC 과정의 Philippe Rigollet : 18.650 응용 프로그램 통계, 2016 년 가을-45 분 비디오보기

https://www.youtube.com/watch?v=vMaKx9fmJHE

나는 매우 유용하다고 생각하는 노트에 보관 한 그의 다이어그램을 조잡하게 복사했습니다.

예 1의 왼쪽 위 다이어그램에서 오른쪽 꼬리에서 경험적 (또는 샘플) Quantile이 이론적 Quantile보다 작다는 것을 알 수 있습니다.

Qe <Qt

$\alpha$

— 자비에르 보렛 시코 테
소스

3

이 스레드는 "정상 qq 플롯을 해석하는 방법"StackExchange 게시물로 결정되었으므로 독자들에게 일반적인 qq 플롯과 초과 첨도 통계 사이의 훌륭하고 정확한 수학적 관계를 지적하고 싶습니다.

여기있어:

https://stats.stackexchange.com/a/354076/102879

간략한 (너무 단순화 된) 요약은 다음과 같이 제공됩니다 (보다 정확한 수학적 설명에 대한 링크 참조). 실제로 실제 qq 플롯에서 초과 첨도를 데이터 Quantile과 해당 이론적 정상 Quantile 사이의 평균 거리 (가중치)로 볼 수 있습니다 데이터에서 평균까지의 거리. 따라서 qq 플롯의 꼬리에있는 절대 값이 일반적으로 극단적 인 방향으로 예상되는 정상 값에서 크게 벗어날 경우 양의 초과 첨도가 있습니다.

첨도는 평균으로부터의 거리에 의해 가중되는 이러한 편차의 평균이기 때문에 qq 플롯의 중심 근처의 값은 첨도에 거의 영향을 미치지 않습니다. 따라서 과도한 첨도는 "피크"가있는 분포의 중심과 관련이 없습니다. 오히려 과잉 첨도는 데이터 분포의 꼬리를 정규 분포와 비교하여 거의 전적으로 결정됩니다.

— 피터 웨스트 폴
소스