GBM予測間隔を見つける方法

キャレットパッケージを使用してGBMモデルを操作し、予測データの予測間隔を解決する方法を探しています。広範囲に検索しましたが、ランダムフォレストの予測間隔を見つけるためのいくつかのアイデアを思いつきました。ヘルプ/ Rコードは大歓迎です！

caret prediction-interval gbm

— CooperBuckeye05
ソース

編集：以下のコメントで指摘されているように、これは予測の信頼区間を提供し、厳密には予測区間を提供しません。私の返事に少しの引き金になったので、これについてもう少し考えるべきだった。

この回答を無視するか、コードに基づいて予測間隔を取得してみてください。

単純なブートストラップを使用して予測間隔を数回作成しましたが、他の（より良い）方法があるかもしれません。

パッケージoil内のデータを考慮し、caretPaliticに対するStearicの影響について部分的な依存関係と95％の間隔を生成するとします。以下は簡単な例ですが、ニーズに合わせていろいろ試してみてください。gbmパッケージが更新されていることを確認して、grid.points引数をplot.gbm

library(caret)
data(oil)
#train the gbm using just the defaults.
tr <- train(Palmitic ~ ., method = "gbm" ,data = fattyAcids, verbose = FALSE)

#Points to be used for prediction. Use the quartiles here just for illustration
x.pt <- quantile(fattyAcids$Stearic, c(0.25, 0.5, 0.75))

#Generate the predictions, or in this case, the partial dependencies at the selected points. Substitute plot() for predict() to get predictions
p <- plot(tr$finalModel, "Stearic", grid.levels = x.pt, return.grid = TRUE)

#Bootstrap the process to get prediction intervals
library(boot)

bootfun <- function(data, indices) {
  data <- data[indices,]

  #As before, just the defaults in this example. Palmitic is the first variable, hence data[,1]
  tr <- train(data[,-1], data[,1], method = "gbm", verbose=FALSE)

  # ... other steps, e.g. using the oneSE rule etc ...
  #Return partial dependencies (or predictions)

  plot(tr$finalModel, "Stearic", grid.levels = x.pt, return.grid = TRUE)$y
  #or predict(tr$finalModel, data = ...)
}

#Perform the bootstrap, this can be very time consuming. Just 99 replicates here but we usually want to do more, e.g. 500. Consider using the parallel option
b <- boot(data = fattyAcids, statistic = bootfun, R = 99)

#Get the 95% intervals from the boot object as the 2.5th and 97.5th percentiles
lims <- t(apply(b$t, 2, FUN = function(x) quantile(x, c(0.025, 0.975))))

これは、少なくともgbmのチューニングから生じる不確実性を考慮しようとする1つの方法です。同様のアプローチがhttp://onlinelibrary.wiley.com/doi/10.2193/2006-503/abstractで使用されています

ポイントの推定値が間隔の外にある場合もありますが、通常は調整グリッドを変更する（つまり、ツリーの数や深さを増やす）ことで解決します。

お役に立てれば！

— エリック
ソース

私があなたのコードを正しく理解していれば、あなたが持っているものは予測のための95％の信頼区間です。これは、残差（ランダム）エラーを追加する95％の予測間隔とは異なります。

— 大井紅14

ど！あなたは正しいです。返信が少々速すぎました。ありがとう、答えを編集します。

— ErikL 14

手伝ってくれてありがとう！ブートストラップ機能に問題があります。その問題をstats.stackexchange.com/questions/117329/…に投稿しました。データセットでブートストラップ機能を適切に設定する方法が正確にわかりません。

— CooperBuckeye05 14

私はこの時点で私が探しているものではないと思うので、私はまだ答えを探しています！

— CooperBuckeye05 14年