QQプロットの解釈方法

173

私は小さなデータセット（21の観測値）で作業しており、Rには次の通常のQQプロットがあります。

ここに画像の説明を入力してください

プロットが正規性をサポートしていないことを見て、基礎となる分布について何を推測できますか？右側に偏った分布がより適切であるように思えます、そうですか？また、データから他にどのような結論を導き出すことができますか？

r data-visualization inference qq-plot

— JohnK
ソース

9

あなたはそれが正しい歪度を示すことは正しいです。QQプロットの解釈に関するいくつかの投稿を見つけようとします。

— Glen_b 14年

3

結論を出す必要はありません。次に何を試すかを決める必要があります。ここでは、データの平方根化またはロギングを検討します。

— ニックコックス14年

11

(- 1.5, 2)

$(-1.5,2)$

(1.5, 220)

$(1.5,220)$

(0, 70)

$(0,70)$

3

@Glen_b私の質問の答えにはいくつかの情報があります：stats.stackexchange.com/questions/71065/…そして答えのリンクには別の良い情報源があります：stats.stackexchange.com/questions/52212/qq-plot-does-not -match-histogram

— tpg2114 14年

なにこれ？QQプロットは、正規分布していないデータを示していますか？！ここに画像の説明を入力してください

— デビッド

293

値が線に沿っている場合、分布は想定した理論上の分布と同じ形状（位置とスケールまで）になります。

ローカル動作：y軸で並べ替えられたサンプル値とx軸で（概算）予想される分位数を見ると、値は、理論上の分布がプロットのそのセクションで想定するよりも多かれ少なかれ集中しています。

ご覧のように、集中度の低いポイントは、全体の線形関係が示唆するよりも急速に増加すると思われるよりも集中ポイントが増加し、極端な場合、サンプルの密度のギャップに対応します（垂直に近いジャンプとして表示されます）または一定値のスパイク（水平方向に並んだ値）。これにより、重いテールまたは軽いテールを見つけることができるため、理論的な分布よりも大きいまたは小さい歪度などを見つけることができます。

全体的な外観：

以下は、QQプロットが（特定の分布の選択に対して）平均してどのように見えるかを示しています。

ここに画像の説明を入力してください

しかし、ランダム性は、特に小さなサンプルの場合、物を曖昧にする傾向があります。

ここに画像の説明を入力してください

$n=21$

ここに画像の説明を入力してください

$n=21$

また、ここでの提案は、特定の量の湾曲またはウィグリネスについてどの程度心配する必要があるかを判断するときに役立ちます。

一般的に解釈に適したガイドには、サンプルサイズの小さいディスプレイと大きいディスプレイも含まれます。

— グレン_b
ソース

18

これは非常に実用的なガイドです。すべての情報を収集してくれてありがとう。

— JohnK 14年

4

ここで重要なのは線形性からの逸脱の形状とタイプであると理解していますが、それでも両方の軸に「...変位値」というラベルが付いており、一方の軸が0.2 0.4 0.6になり、他方の軸が-2 -1 0になります1 2.ここでも、一部のデータポイントが理論上の分布の中央40％以内にあるように見えますが、右下のプロットのy軸が示すように、どのようにそれらの分布を独自の分布の3％に分布させることができますか？

— マコンド14

2

@Macond y軸は、変位値ではなく、データの生の値を示します。y軸を標準化すると物事がより明確になることに同意し、Rがデフォルトでこれを行わない理由はわかりません。誰かがこれに光を当てることができますか？

— ゴードンガスタフソン

4

@GordonGustafsonのMacondへの最初のコメントに関しては、データを標準化しない非常に良い理由があります-QQプロットはデータの表示です！関数に提供するデータの情報を表示するように設計されています（ボックスプロットまたはヒストグラムに提供するデータを標準化することは意味があります）。変換すると、データの表示ではなくなります（プロット内の形状は似ている場合がありますが、プロット上の位置やスケールは表示されなくなります）。標準化されたプロットでより明確になると思うのはわかりません-明確にできますか？

— -Glen_b

2

@ZiyaoWeiいいえ、ユニフォームのテールは非常に軽いため、テールはまったくありません。すべてがセンターから2 MAD以内にあります。この答えの最初の段落は、「より重い尾」が何を意味するかについて考えるための明確で一般的な方法を提供します。

— Glen_b

63

通常のQQプロットの解釈に役立つ光沢のあるアプリを作成しました。このリンクを試してください。

このアプリでは、データの歪度、テールネス（尖度）、およびモダリティを調整でき、ヒストグラムとQQプロットがどのように変化するかを確認できます。逆に、QQプロットのパターンを指定して、歪度などを確認する方法で使用できます。

詳細については、そのドキュメントを参照してください。

このアプリをオンラインで提供するのに十分な空き容量がないことに気付きました。：要求として、私はすべての3つのコードチャンクを提供しsample.R、server.Rそしてui.Rここに。このアプリの実行に興味がある人は、これらのファイルをRstudioにロードしてから、自分のPCで実行するだけです。

sample.Rファイル：

# Compute the positive part of a real number x, which is $\max(x, 0)$.
positive_part <- function(x) {ifelse(x > 0, x, 0)}

# This function generates n data points from some unimodal population.
# Input: ----------------------------------------------------
# n: sample size;
# mu: the mode of the population, default value is 0.
# skewness: the parameter that reflects the skewness of the distribution, note it is not
#           the exact skewness defined in statistics textbook, the default value is 0.
# tailedness: the parameter that reflects the tailedness of the distribution, note it is
#             not the exact kurtosis defined in textbook, the default value is 0.

# When all arguments take their default values, the data will be generated from standard 
# normal distribution.

random_sample <- function(n, mu = 0, skewness = 0, tailedness = 0){
  sigma = 1

  # The sampling scheme resembles the rejection sampling. For each step, an initial data point
  # was proposed, and it will be rejected or accepted based on the weights determined by the
  # skewness and tailedness of input. 
  reject_skewness <- function(x){
      scale = 1
      # if `skewness` > 0 (means data are right-skewed), then small values of x will be rejected
      # with higher probability.
      l <- exp(-scale * skewness * x)
      l/(1 + l)
  }

  reject_tailedness <- function(x){
      scale = 1
      # if `tailedness` < 0 (means data are lightly-tailed), then big values of x will be rejected with
      # higher probability.
      l <- exp(-scale * tailedness * abs(x))
      l/(1 + l)
  }

  # w is another layer option to control the tailedness, the higher the w is, the data will be
  # more heavily-tailed. 
  w = positive_part((1 - exp(-0.5 * tailedness)))/(1 + exp(-0.5 * tailedness))

  filter <- function(x){
    # The proposed data points will be accepted only if it satified the following condition, 
    # in which way we controlled the skewness and tailedness of data. (For example, the 
    # proposed data point will be rejected more frequently if it has higher skewness or
    # tailedness.)
    accept <- runif(length(x)) > reject_tailedness(x) * reject_skewness(x)
    x[accept]
  }

  result <- filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5)))
  # Keep generating data points until the length of data vector reaches n.
  while (length(result) < n) {
    result <- c(result, filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5))))
  }
  result[1:n]
}

multimodal <- function(n, Mu, skewness = 0, tailedness = 0) {
  # Deal with the bimodal case.
  mumu <- as.numeric(Mu %*% rmultinom(n, 1, rep(1, length(Mu))))
  mumu + random_sample(n, skewness = skewness, tailedness = tailedness)
}

server.Rファイル：

library(shiny)
# Need 'ggplot2' package to get a better aesthetic effect.
library(ggplot2)

# The 'sample.R' source code is used to generate data to be plotted, based on the input skewness, 
# tailedness and modality. For more information, see the source code in 'sample.R' code.
source("sample.R")

shinyServer(function(input, output) {
  # We generate 10000 data points from the distribution which reflects the specification of skewness,
  # tailedness and modality. 
  n = 10000

  # 'scale' is a parameter that controls the skewness and tailedness.
  scale = 1000

  # The `reactive` function is a trick to accelerate the app, which enables us only generate the data
  # once to plot two plots. The generated sample was stored in the `data` object to be called later.
  data <- reactive({
    # For `Unimodal` choice, we fix the mode at 0.
    if (input$modality == "Unimodal") {mu = 0}

    # For `Bimodal` choice, we fix the two modes at -2 and 2.
    if (input$modality == "Bimodal") {mu = c(-2, 2)}

    # Details will be explained in `sample.R` file.
    sample1 <- multimodal(n, mu, skewness = scale * input$skewness, tailedness = scale * input$kurtosis)
    data.frame(x = sample1)})

  output$histogram <- renderPlot({
    # Plot the histogram.
    ggplot(data(), aes(x = x)) + 
      geom_histogram(aes(y = ..density..), binwidth = .5, colour = "black", fill = "white") + 
      xlim(-6, 6) +
      # Overlay the density curve.
      geom_density(alpha = .5, fill = "blue") + ggtitle("Histogram of Data") + 
      theme(plot.title = element_text(lineheight = .8, face = "bold"))
  })

  output$qqplot <- renderPlot({
    # Plot the QQ plot.
    ggplot(data(), aes(sample = x)) + stat_qq() + ggtitle("QQplot of Data") + 
      theme(plot.title = element_text(lineheight=.8, face = "bold"))
    })
})

最後に、ui.Rファイル：

library(shiny)

# Define UI for application that helps students interpret the pattern of (normal) QQ plots. 
# By using this app, we can show students the different patterns of QQ plots (and the histograms,
# for completeness) for different type of data distributions. For example, left skewed heavy tailed
# data, etc. 

# This app can be (and is encouraged to be) used in a reversed way, namely, show the QQ plot to the 
# students first, then tell them based on the pattern of the QQ plot, the data is right skewed, bimodal,
# heavy-tailed, etc.


shinyUI(fluidPage(
  # Application title
  titlePanel("Interpreting Normal QQ Plots"),

  sidebarLayout(
    sidebarPanel(
      # The first slider can control the skewness of input data. "-1" indicates the most left-skewed 
      # case while "1" indicates the most right-skewed case.
      sliderInput("skewness", "Skewness", min = -1, max = 1, value = 0, step = 0.1, ticks = FALSE),

      # The second slider can control the skewness of input data. "-1" indicates the most light tail
      # case while "1" indicates the most heavy tail case.
      sliderInput("kurtosis", "Tailedness", min = -1, max = 1, value = 0, step = 0.1, ticks = FALSE),

      # This selectbox allows user to choose the number of modes of data, two options are provided:
      # "Unimodal" and "Bimodal".
      selectInput("modality", label = "Modality", 
                  choices = c("Unimodal" = "Unimodal", "Bimodal" = "Bimodal"),
                  selected = "Unimodal"),
      br(),
      # The following helper information will be shown on the user interface to give necessary
      # information to help users understand sliders.
      helpText(p("The skewness of data is controlled by moving the", strong("Skewness"), "slider,", 
               "the left side means left skewed while the right side means right skewed."), 
               p("The tailedness of data is controlled by moving the", strong("Tailedness"), "slider,", 
                 "the left side means light tailed while the right side means heavy tailedd."),
               p("The modality of data is controlledy by selecting the modality from", strong("Modality"),
                 "select box.")
               )
  ),

  # The main panel outputs two plots. One plot is the histogram of data (with the nonparamteric density
  # curve overlaid), to get a better visualization, we restricted the range of x-axis to -6 to 6 so 
  # that part of the data will not be shown when heavy-tailed input is chosen. The other plot is the 
  # QQ plot of data, as convention, the x-axis is the theoretical quantiles for standard normal distri-
  # bution and the y-axis is the sample quantiles of data. 
  mainPanel(
    plotOutput("histogram"),
    plotOutput("qqplot")
  )
)
)
)

— han雄
ソース

1

Shinyアプリの容量が限界に達しているようです。コードを提供するだけでいいかもしれません

— -rsoren

1

@rsorenが付け加えた、それが助けになることを望み、私は提案を聞くことを楽しみにしている。

— -Zhanxiong

非常に素晴らしい！また、サンプルサイズとランダム度を変更するオプションを追加することをお勧めします。

— イタマル

リンクは利用できません!!!! @Zhanxiong

— Alireza Sanaee

毎月限られた回数のクリックの後、リンクが応答しないようです。これが、ソースコードをここに貼り付けた理由です（あなたと同じ問題に遭遇した他のユーザーからの要求に応じて）。それらをRスタジオに貼り付け、自分のPCで実行できます（必要なパッケージを事前にロードした後）。

— Zhanxiong

6

非常に役立つ（そして直感的な）説明がprofによって与えられます。MIT MOOCコースのPhilippe Rigollet：2016年秋、アプリケーションの18.650統計-45分のビデオを参照

https://www.youtube.com/watch?v=vMaKx9fmJHE

私は彼の図を大まかにコピーしました。それは非常に便利だと思うので、メモに残しています。

例1の左上の図では、右端の経験的（またはサンプル）分位数が理論的分位数よりも小さいことがわかります。

Qe <Qt

$\alpha$

— ザビエル・バレット・シコット
ソース

3

このスレッドは決定的な「通常のqqプロットの解釈方法」StackExchange投稿とみなされているので、読者に通常のqqプロットと過剰尖度統計量の間の素敵で正確な数学的関係を示したいと思います。

ここにあります：

https://stats.stackexchange.com/a/354076/102879

簡潔な（あまりにも単純化された）要約を以下に示します（より正確な数学的ステートメントについてはリンクを参照してください）：実際のデータの変位値と対応する理論的な正規変位値間の平均距離として、通常のqqプロットで過剰な尖度を見ることができますデータから平均までの距離によって。したがって、qqプロットの裾の絶対値が通常、予想される正常値から極端な方向に大きく逸脱する場合、正の過剰尖度があります。

尖度は平均からの距離で重み付けされたこれらの偏差の平均であるため、qqプロットの中心付近の値は尖度にほとんど影響しません。したがって、過剰な尖度は、分布の中心、つまり「ピーク」に関連していません。むしろ、過剰な尖度は、データ分布の裾と正規分布の比較によってほぼ完全に決定されます。

— ピーター・ウェストフォール
ソース