Javaでの類似文字列の比較

111

複数の文字列を比較して、最も類似している文字列を見つけます。どの文字列が他の文字列により類似しているかを返すライブラリ、メソッド、またはベストプラクティスがあるかどうか疑問に思っていました。例えば：

「ジャンプしたキツネ」→「ジャンプしたキツネ」
「急にキツネが跳んだ」->「キツネ」

この比較では、1つ目が2つ目よりも類似していることが返されます。

私は次のような方法が必要だと思います：

double similarityIndex(String s1, String s2)

どこかにそのようなものはありますか？

編集：なぜ私はこれをしているのですか？MSプロジェクトファイルの出力を、タスクを処理するレガシーシステムの出力と比較するスクリプトを書いています。レガシーシステムのフィールド幅は非常に限られているため、値が追加されると説明は省略されます。生成されたキーを取得できるように、MS Projectのどのエントリがシステムのエントリと類似しているかを半自動化する方法を見つけたいです。手動でチェックする必要があるため、欠点がありますが、多くの作業を節約できます

java string-comparison

— マリオ・オルテゴン
ソース

82

はい、次のような多くの十分に文書化されたアルゴリズムがあります。

コサイン類似度
ジャカードの類似性
サイコロの係数
マッチングの類似性
重複の類似性
などなど

良い要約（ "Sam's String Metrics"）はここにあります（元のリンクは停止しているため、インターネットアーカイブにリンクしています）。

これらのプロジェクトも確認してください：

— DFA
ソース

18

+1 simmetricsサイトはもうアクティブではないようです。ただし、sourceforgeでコードを見つけました：sourceforge.net/projects/simmetricsポインタをありがとう。

— Michael Merchant

7

「これを確認できます」リンクが壊れています。

— キリル

1

それが、Michael Merchantが上記の正しいリンクを投稿した理由です。

— emilyk 2014

2

sourceforgeでのsimmetricsのjarは少し古いです。github.com/ mpkorstanje / simmetricsは、Mavenアーティファクトを含む更新されたgithubページです

— tom91136

@MichaelMerchantのコメントに追加するには、プロジェクトをgithubから入手することもできます。あまり活発ではありませんが、sourceforgeより少し新しいです。

— Ghurdyl 2018

163

多くのライブラリで使用されているように、2つの文字列間の類似性を0％-100％の方法で計算する一般的な方法は、長い文字列を短い文字列に変換するためにどれだけ（％）変更する必要があるかを測定することです。

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below

の計算`editDistance()`：

上記のeditDistance()関数は、2つの文字列間の編集距離を計算することが期待されています。このステップにはいくつかの実装があり、それぞれが特定のシナリオにより適している場合があります。最も一般的なのはレーベンシュタイン距離アルゴリズムであり、以下の例ではそれを使用します（非常に大きな文字列の場合、他のアルゴリズムの方がパフォーマンスがよくなる可能性があります）。

編集距離を計算する2つのオプションは次のとおりです。

Apache Commons Textのレーベンシュタイン距離の実装を使用できます。 apply(CharSequence left, CharSequence rightt)
自分で実装します。以下に実装例を示します。

作業例：

こちらのオンラインデモをご覧ください。

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

出力：

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

— acdcjunior
ソース

11

レーベンシュタイン距離法は、で使用できますorg.apache.commons.lang3.StringUtils。

— Cleankod 2014

@Cleankod今ではコモンズ・テキストの一部です：commons.apache.org/proper/commons-text/javadocs/api-release/org/...

— ルイス

15

レーベンシュタイン距離アルゴリズムをJavaScript に翻訳しました。

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

— ユーザー493744
ソース

11

レーベンシュタイン距離を使用して、2つの文字列の差を計算できます。 http://en.wikipedia.org/wiki/Levenshtein_distance

— フロリアン・ファンクハウザー
ソース

2

レーベンシュタインは、いくつかの文字列には最適ですが、多数の文字列間の比較には対応しません。

— 09年

私はJavaでLevenshteinを使用して成功しています。膨大なリストを比較していないので、パフォーマンスに影響が出る可能性があります。また、それは少し単純であり、調整を使用して、必要以上に類似していると見なされる傾向のある短い単語（3文字または4文字など）のしきい値を上げることができます（猫から犬への3つの編集のみです）。以下に提案されているものはほとんど同じです-Levenshteinは編集距離の特定の実装です。

— ルバーブ

Levenshteinと効率的なSQLクエリを組み合わせる方法を示す記事は次のとおりです。literatejava.com

— Thomas W

10

実際、文字列の類似性の測定値は多数あります。

レーベンシュタイン距離の編集;
Damerau-Levenshtein距離。
Jaro-Winklerの類似性。
最長共通サブシーケンス編集距離。
Q-グラム（ウッコネン）;
n-グラム距離（コンドラク）;
Jaccardインデックス。
ソレンセンダイス係数;
コサイン類似度;
...

これらの説明とJava実装については、https： //github.com/tdebatty/java-string-similarityをご覧ください。

— Thibault Debatty
ソース

8

これは、Apache Commons Javaライブラリを使用して実現できます。その中にこの2つの機能を見てみましょう：
- getLevenshteinDistance
- getFuzzyDistance

— ノエリクス
ソース

3

2017年10月以降、リンクされたメソッドは非推奨になりました。代わりに、コモンズテキストライブラリの LevenshteinDistanceクラスとFuzzyScoreクラスを使用してください

— vatbub

3

理論的には、編集距離を比較できます。

— アントン・ゴゴレフ
ソース

3

これは通常、編集距離メジャーを使用して行われます。「edit distance java」を検索すると、このような多数のライブラリが表示されます。

— ローレンス・ゴンサルベス
ソース

3

盗作ファインダーのように聞こえる文字列がドキュメントに変わった場合、私にます。たぶんその用語で検索すると何か良いものが見つかります。

「Programming Collective Intelligence」には、2つのドキュメントが類似しているかどうかを判断するための章があります。コードはPythonで記述されていますが、クリーンで移植が簡単です。

— ダフィーモ
ソース

3

最初の回答者のおかげで、computeEditDistance（s1、s2）の計算は2つあると思います。時間がかかるため、コードのパフォーマンスを改善することにしました。そう：

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}

— モーセンアバシ
ソース

0

zアルゴリズムを使用して、文字列の類似性を見つけることもできます。ここをクリックhttps://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/

— アスール・サミュエル
ソース

Javaでの類似文字列の比較

の計算editDistance()：

作業例：

の計算`editDistance()`：