スパース性のこの解釈は正確ですか？

パッケージのremoveSparseTerms関数のドキュメントによるとtm、これはスパース性が伴うものです：

A term-document matrix where those terms from x are removed which have at least a sparse percentage of empty (i.e., terms occurring 0 times in a document) elements. I.e., the resulting matrix contains only terms with a sparse factor of less than sparse.

では、これがsparse.99に等しいかどうかを正しく解釈すると、データの最大1％にしか出現しない用語が削除されますか？

r text-mining natural-language

— zthomas.nc
ソース

この質問は、tmとテキストマイニングのタグがあるStackoverflowに適しています。

— ケン・ブノワ

はい。ここでの混乱は理解できますが、「スパース性」という用語をこの文脈で明確に定義することは難しいためです。

意味でsparseの引数removeSparseTerms()、スパース性は、しきい値を指す相対ドキュメント頻度用語、のためにその上用語は削除されます。ここでの相対ドキュメント頻度は比率を意味します。コマンドのヘルプページに（あまり明確ではありませんが）記載されているように、スパース性は1.0に近づくほど小さくなります。（スパース性は0または1.0の値をとることはできず、その間の値のみをとることに注意してください。）

sparse = 0.99sparse = 0.99 $j$ $df_j > N * (1 - 0.99)$ $N$

もう一方の極端に近い場合はsparse = .01、すべてのドキュメントに（ほぼ）出現する用語のみが保持されます。（もちろん、これは用語の数とドキュメントの数に依存します。自然言語では、「the」のような一般的な単語はすべてのドキュメントで発生する可能性が高いため、決して「疎」ではありません。）

スパース性のしきい値が0.99の例。多くても（最初の例）0.01未満のドキュメントで発生し、（2番目の例）0.01をわずかに超えるドキュメントで発生する項：

> # second term occurs in just 1 of 101 documents
> myTdm1 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,1), rep(0, 100)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm1, .99)
<<DocumentTermMatrix (documents: 101, terms: 1)>>
Non-/sparse entries: 101/0
Sparsity           : 0%
Maximal term length: 2
Weighting          : term frequency (tf)
> 
> # second term occurs in 2 of 101 documents
> myTdm2 <- as.DocumentTermMatrix(slam::as.simple_triplet_matrix(matrix(c(rep(1, 101), rep(1,2), rep(0, 99)), ncol=2)), 
+                                weighting = weightTf)
> removeSparseTerms(myTdm2, .99)
<<DocumentTermMatrix (documents: 101, terms: 2)>>
Non-/sparse entries: 103/99
Sparsity           : 49%
Maximal term length: 2
Weighting          : term frequency (tf)

以下に、実際のテキストと用語を使用したいくつかの追加例を示します。

> myText <- c("the quick brown furry fox jumped over a second furry brown fox",
              "the sparse brown furry matrix",
              "the quick matrix")

> require(tm)
> myVCorpus <- VCorpus(VectorSource(myText))
> myTdm <- DocumentTermMatrix(myVCorpus)
> as.matrix(myTdm)
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .01))
    Terms
Docs the
   1   1
   2   1
   3   1
> as.matrix(removeSparseTerms(myTdm, .99))
    Terms
Docs brown fox furry jumped matrix over quick second sparse the
   1     2   2     2      1      0    1     1      1      0   1
   2     1   0     1      0      1    0     0      0      1   1
   3     0   0     0      0      1    0     1      0      0   1
> as.matrix(removeSparseTerms(myTdm, .5))
    Terms
Docs brown furry matrix quick the
   1     2     2      0     1   1
   2     1     1      1     0   1
   3     0     0      1     1   1

の最後の例でsparse = 0.34は、ドキュメントの3分の2に出現する用語のみが保持されました。

ドキュメントの頻度に基づいてドキュメント用語マトリックスから用語をトリミングする別の方法は、テキスト分析パッケージquantedaです。ここでの同じ機能は、スパース性を指すのではなく、用語のドキュメント頻度を直接指します（tf-idfなど）。

> require(quanteda)
> myDfm <- dfm(myText, verbose = FALSE)
> docfreq(myDfm)
     a  brown    fox  furry jumped matrix   over  quick second sparse    the 
     1      2      1      2      1      2      1      2      1      1      3 
> trim(myDfm, minDoc = 2)
Features occurring in fewer than 2 documents: 6 
Document-feature matrix of: 3 documents, 5 features.
3 x 5 sparse Matrix of class "dfmSparse"
       features
docs    brown furry the matrix quick
  text1     2     2   1      0     1
  text2     1     1   1      1     0
  text3     0     0   1      1     1

この使用法は私にはずっと簡単に思えます。

— ケン・ブノワ
ソース

ケンのサイトへようこそ。すばらしい回答をありがとう。より多くの方にお会いできれば幸いです。

— Glen_b-2015