バッファー内の単語の頻度の統計を取得する方法

言語学者や他の多くの科学者にとって、テキストに現れる単語の頻度を分析することは素晴らしいツールです。一部の商用テキストエディタと一部のWebサイトは、このツールを提供しています。

単語の頻度分析。単語を頻度の降順で並べ替えます。このテキストの例

Emacs Stack Exchange is a question and answer site for those using, extending, or developing the emacs text editor. It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help, we're working together to build a library of detailed answers to every question about emacs.

我々は持っています：

56 words
9: punctuation marks

3: ,
3: .
3: a
3: emacs
3: of
2: '
2: and
2: exchange
2: question
2: stack
2: the
2: to
1: &
1: about
1: answer
1: answers
1: as
1: build
1: built
1: by
1: detailed
1: developing
1: editor
1: every
1: extending
1: for
1: help
1: is
1: it
1: library
1: network
1: or
1: part
1: q
1: re
1: run
1: s
1: site
1: sites
1: text
1: those
1: together
1: using
1: we
1: with
1: working
1: you
1: your

そのような統計を提供するために使用できるパッケージがすでに存在するかどうか疑問に思っています。

PS私はすでに同じエスプリで別の質問をしてきましたが、すばらしい答えが得られました（できればもっと賛成したいと思います）。

elisp text

— 名前
ソース

さらに一歩進んで逆インデックスを作成するのが望ましいでしょうか？（逆インデックスを指定して単語の頻度を評価することは簡単ですが、インデックスを使用して、2つの単語がドキュメント内で接近して表示されるかどうかなど、他の興味深いものを見つけることができます）。私は最近Sphinxで遊んでいるので、Emacsに接続すると検索オプションが増えるでしょうか？

— wvxvw

@wvxvwはい、それは素晴らしいアイデアでしょう。私はSphinxの経験はありませんが、emacsで使用する方法を知っている場合は、それについて知っていただければ幸いです。

— 名前

回答:

出力の形式（org-mode table）は、質問のリンクから発想を得ています。

(require 'cl-lib)

(defvar punctuation-marks '(","
                            "."
                            "'"
                            "&"
                            "\"")
  "List of Punctuation Marks that you want to count.")

(defun count-raw-word-list (raw-word-list)
  (cl-loop with result = nil
           for elt in raw-word-list
           do (cl-incf (cdr (or (assoc elt result)
                             (first (push (cons elt 0) result)))))
           finally return (sort result
                                (lambda (a b) (string< (car a) (car b))))))

(defun word-stats ()
  (interactive)
  (let* ((words (split-string
                 (downcase (buffer-string))
                 (format "[ %s\f\t\n\r\v]+"
                         (mapconcat #'identity punctuation-marks ""))
                 t))
         (punctuation-marks (cl-remove-if-not
                             (lambda (elt) (member elt punctuation-marks))
                             (split-string (buffer-string) "" t )))
         (raw-word-list (append punctuation-marks words))
         (word-list (count-raw-word-list raw-word-list)))
    (with-current-buffer (get-buffer-create "*word-statistics*")
      (erase-buffer)
      (insert "| word | occurences |
               |-----------+------------|\n")

      (dolist (elt word-list)
        (insert (format "| '%s' | %d |\n" (car elt) (cdr elt))))

      (org-mode)
      (indent-region (point-min) (point-max))
      (goto-char 100)
      (org-cycle)
      (goto-char 79)
      (org-table-sort-lines nil ?N)))
  (pop-to-buffer "*word-statistics*"))

— Xuchunyang
ソース

だけword-statsでなく3つの形式があり、それらすべてを評価する必要があります。

— xuchunyang 2015年

なぜうまくいかないのか、私には手がかりがありません。ご覧のとおりpunctuation-marks、最初のフォームで既に定義されています。

— xuchunyang 2015年

動作することを確認しました。最初の段落をコピーしなかった可能性があります。

— 名前

ミニマリストのemacs環境（v24.5）では(require 'cl)、incrコマンドを使用するために追加する必要がありました。

— チャドウィック

@チャドウィック私はcl-incf代わりに使用するように答えを更新し、明示的にincf要求しましたcl-lib。

— xuchunyang 2015年

次のコードを評価し、テキストを含むバッファーにMx word-frequencyと入力します。単語の出現数とパーセント値を含むバッファーが表示されます。

(defvar word-frequency-table (make-hash-table :test 'equal :size 128))

(defvar word-frequency-buffer "*frequencies*"
  "Buffer where frequencies are displayed.")

(defun word-frequency-incr (word)
  (puthash word (1+ (gethash word word-frequency-table 0)) word-frequency-table))

(defun word-frequency-list (&optional reverse limit)
  "Returns a cons which car is sum of times any word was used
and cdr is a list of (word . count) pairs.  If REVERSE is nil
sorts it starting from the most used word; if it is 'no-sort
the list is not sorted; if it is non-nil and not 'no-sort sorts
it from the least used words.  If LIMIT is positive number
only words which were used more then LIMIT times will be
added.  If it is negative number only words which were used
less then -LIMIT times will be added."
  (let (l (sum 0))
    (maphash
     (cond
      ((or (not (numberp limit)) (= limit 0))
       (lambda (k v) (setq l (cons (cons k v) l) sum (+ sum v))))
      ((= limit -1) (lambda (k v) (setq sum (+ sum v))))
      ((< limit 0)
       (setq limit (- limit))
       (lambda (k v) (setq sum (+ sum v))
         (if (< v limit) (setq l (cons (cons k v) l)))))
      (t
       (lambda (k v) (setq sum (+ sum v))
         (if (> v limit) (setq l (cons (cons k v) l))))))
     word-frequency-table)
    (cons sum
          (cond
           ((equal reverse 'no-sort) l)
           (reverse (sort l (lambda (a b) (< (cdr a) (cdr b)))))
           (t       (sort l (lambda (a b) (> (cdr a) (cdr b)))))))))

(defun word-frequency-string (&optional reverse limit func)
  "Returns formatted string with word usage statistics.

If FUNC is nil each line contains number of times word was
called and the word; if it is t percentage usage is added in
the middle; if it is 'raw each line will contain number an
word separated by single line (with no formatting) otherwise
FUNC must be a function returning a string which will be called
for each entry with three arguments: number of times word was
called, percentage usage and the word.

See `word-frequency-list' for description of REVERSE and LIMIT
arguments."
  (let* ((list (word-frequency-list reverse)) (sum (car list)))
    (mapconcat
     (cond
      ((not func) (lambda (e) (format "%7d  %s\n" (cdr e) (car e))))
      ((equal func t)
       (lambda (e) (format "%7d  %6.2f%%  %03d %s\n"
                           (cdr e) 
               (/ (* 1e2 (cdr e)) sum) 
               (length (car e))
               (car e))))
      ((equal func 'raw) (lambda (e) (format "%d %s\n" (cdr e) (car e))))
      (t (lambda (e) (funcall func (cdr e) (/ (* 1e2 (cdr e)) sum) (car e)))))
     (cdr list) "")))

(defun word-frequency (&optional where reverse limit func)
  "Formats word usage statistics using
`word-frequency-string' function (see for description of
REVERSE, LIMIT and FUNC arguments) and:
- if WHERE is nil inserts it in th e
  or displays it in echo area if possible; else
- if WHERE is t inserts it in the current buffer; else
- if WHERE is an empty string inserts it into
  `word-frequency-buffer' buffer; else
- inserts it into buffer WHERE.

When called interactively behaves as if WHERE and LIMIT were nil,
FUNC was t and:
- with no prefix argument - REVERSE was nil;
- with universal or positive prefix arument - REVERSE was t;
- with negative prefix argument - REVERSE was 'no-sort."

  (interactive (list nil
                     (cond
                      ((not current-prefix-arg) nil)
                      ((> (prefix-numeric-value current-prefix-arg) 0))
                      (t 'no-sort))
                     nil t))
  (clrhash word-frequency-table)
  (word-frequency-process-buffer)
  (cond
   ((not where)
    (display-message-or-buffer (word-frequency-string reverse limit func)
                               word-frequency-buffer))
   ((equal where t)
    (insert (word-frequency-string reverse limit func)))
   (t
    (display-buffer
     (if (and (stringp where) (string= where ""))
         word-frequency-buffer where)
     (word-frequency-string reverse limit func)))))

(defun word-frequency-process-buffer ()
  (interactive)
  (let ((buffer (current-buffer))
        bounds
        beg
        end
        word)
    (save-excursion
      (goto-char (point-min))
      (while (re-search-forward "\\<[[:word:]]+\\>" nil t)
;;    (while (forward-word 1)
        (word-frequency-incr (downcase (match-string 0)))
;;      (setq bounds (bounds-of-thing-at-point 'word))
;;      (setq beg (car bounds))
;;      (setq end (cdr bounds))
;;      (setq word (downcase (buffer-substring-no-properties beg end)))
;;      (word-frequency-incr word)
        ))))

— セウェリン
ソース

私はあなたのコードがうまく機能することを確認します。句読点も数えるように変更することはできますか？

— 名前