トレーニングデータに存在しない新しい因子レベル

「トレーニングデータに存在しない新しい要素レベル」エラーが発生します。しかし、開発中のすべての列とテストデータのnlevelsとclassをチェックしましたが、それらは同じです。もっともらしい説明はありますか？

r machine-learning random-forest many-categories

— アレックス
ソース

参照： stats.stackexchange.com/questions/298137/...

— HalvorsenのはKjetil bの

RFは、因子をワンホットエンコーディングで処理します。因子変数のすべてのレベルに対して1つの新しいダミー列を作成します。スコアリングデータフレームに新しいまたは異なる因子レベルがあると、悪いことが起こります。

因子が定義された時点で、トレーニングとテストが同じデータ構造に共存していた場合、問題はありません。テストの要素が個別に定義されている場合、問題が発生します。

library("randomForest")

# Fit an RF on a few numerics and a factor. Give test set a new level.
N <- 100
df <- data.frame(num1 = rnorm(N), 
                 num2 = rnorm(N), 
                 fac = sample(letters[1:4], N, TRUE),
                 y = rnorm(N),
                 stringsAsFactors = FALSE)
df[100, "fac"] <- "a suffusion of yellow"
df$fac <- as.factor(df$fac)

train <- df[1:50, ]
test <- df[51:100, ]

rf <- randomForest(y ~ ., data=train)

# This is fine, even though the "yellow" level doesn't exist in train, RF
# is aware that it is a valid factor level
predict(rf, test)

# This is not fine. The factor level is introduced and RF can't know of it
test$fac <- as.character(test$fac)
test[50, "fac"] <- "toyota corolla"
test$fac <- as.factor(test$fac)
predict(rf, test)

この問題を回避するには、トレーニングデータと一致するようにスコアリングファクターのレベルを変更します。

# Can get around by relevelling the new factor. "toyota corolla" becomes NA
test$fac <- factor(test$fac, levels = levels(train$fac))
predict(rf, test)

— デックスグローブ
ソース

これは確かに回避策ですが、トレーニングデータとテストデータが完全に別々に共存することを想定していることを考えると、このアプローチの健全性については予約があります。

— Tommyixi