자바의 유사성 문자열 비교

111

여러 문자열을 서로 비교하고 가장 유사한 문자열을 찾고 싶습니다. 어떤 문자열이 다른 문자열과 더 유사한 지 알려주는 라이브러리, 방법 또는 모범 사례가 있는지 궁금합니다. 예를 들면 :

"빠른 여우가 뛰어"-> "여우가 뛰어"
"빠른 여우가 뛰어"-> "여우"

이 비교는 첫 번째가 두 번째보다 더 유사하다는 것을 반환합니다.

다음과 같은 방법이 필요하다고 생각합니다.

double similarityIndex(String s1, String s2)

어딘가에 그런 것이 있습니까?

편집 : 왜 이러는 거죠? MS 프로젝트 파일의 출력을 작업을 처리하는 일부 레거시 시스템의 출력과 비교하는 스크립트를 작성하고 있습니다. 레거시 시스템은 필드 너비가 매우 제한적이므로 값이 추가되면 설명이 축약됩니다. 생성 된 키를 얻을 수 있도록 MS Project의 항목이 시스템의 항목과 유사한 것을 찾는 반자동 방법을 원합니다. 여전히 수동으로 확인해야하는 단점이 있지만 많은 작업을 절약 할 수 있습니다.

java string-comparison

— 마리오 오르 테곤
소스

82

예, 다음과 같이 잘 문서화 된 알고리즘이 많이 있습니다.

코사인 유사성
Jaccard 유사성
주사위 계수
유사성 일치
중복 유사성
기타 등등

좋은 요약 ( "Sam 's String Metrics") 은 여기에서 찾을 수 있습니다 (원래 링크가 작동하지 않아 인터넷 아카이브로 연결됨).

또한 다음 프로젝트를 확인하십시오.

— dfa
소스

18

+1 simmetrics 사이트가 더 이상 활성화되지 않은 것 같습니다. 그러나 sourceforge에서 코드를 찾았습니다. sourceforge.net/projects/simmetrics 포인터 주셔서 감사합니다.

— Michael Merchant

7

"당신은 이것을 확인할 수 있습니다"링크가 깨졌습니다.

— Kiril

1

이것이 Michael Merchant가 위에 올바른 링크를 게시 한 이유입니다.

— emilyk 2014 년

2

소스 포지에 simmetrics의 항아리는 오래된 조금이다, github.com/mpkorstanje/simmetrics은 받는다는 유물로 업데이트 GitHub의 페이지입니다

— tom91136

@MichaelMerchant의 의견에 추가하기 위해 프로젝트는 github 에서도 사용할 수 있습니다 . 그다지 활발하지 않지만 sourceforge보다 조금 더 최근입니다.

— Ghurdyl

163

많은 라이브러리에서 사용되는 0 % -100 % 방식으로 두 문자열 사이의 유사성 을 계산하는 일반적인 방법은 긴 문자열을 더 짧은 문자열로 바꾸기 위해 변경해야하는 정도 (%)를 측정하는 것입니다.

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below

계산 `editDistance()`:

editDistance()위 의 함수 는 두 문자열 사이의 편집 거리 를 계산합니다 . 이 단계 에는 여러 가지 구현 이 있으며 각각 특정 시나리오에 더 적합 할 수 있습니다. 가장 일반적인 것은 Levenshtein 거리 알고리즘 이며 아래 예제에서 사용할 것입니다 (매우 큰 문자열의 경우 다른 알고리즘이 더 잘 수행 될 수 있음).

편집 거리를 계산하는 두 가지 옵션은 다음과 같습니다.

Apache Commons Text 의 Levenshtein distance 구현을 사용할 수 있습니다 . apply(CharSequence left, CharSequence rightt)
직접 구현하십시오. 아래에 구현 예가 있습니다.

작업 예 :

여기에서 온라인 데모를 참조하십시오.

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

산출:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

— acdcjunior
소스

11

Levenshtein 거리 방법은 org.apache.commons.lang3.StringUtils.

— Cleankod 2014-12-05

@Cleankod 지금은 평민 텍스트의 일부입니다 commons.apache.org/proper/commons-text/javadocs/api-release/org/...

— 루이스

15

Levenshtein 거리 알고리즘 을 JavaScript로 번역했습니다 .

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

— 사용자 493744
소스

11

Levenshtein distance를 사용하여 두 문자열의 차이를 계산할 수 있습니다. http://en.wikipedia.org/wiki/Levenshtein_distance

— 플로리안 팡 카우 저
소스

2

Levenshtein은 몇 개의 문자열에 적합하지만 많은 수의 문자열 간의 비교로 확장되지 않습니다.

— 지출 자 2009-06-05

Java에서 Levenshtein을 사용하여 성공했습니다. 큰 목록에 대한 비교를 수행하지 않았으므로 성능 저하가있을 수 있습니다. 또한 그것은 약간 간단하고 약간의 조정을 사용하여 짧은 단어 (예 : 3 자 또는 4 자)에 대한 임계 값을 올릴 수 있습니다. 이는해야하는 것보다 더 비슷하게 보이는 경향이 있습니다 (고양이에서 개로 3 번만 편집). 아래에 제안 된 내용은 거의 동일합니다. Levenshtein은 편집 거리의 특정 구현입니다.

— Rhubarb

다음은 Levenshtein을 효율적인 SQL 쿼리와 결합하는 방법을 보여주는 기사입니다. literatejava.com/sql/fuzzy-string-search-sql

— Thomas W

10

실제로 많은 문자열 유사성 측정이 있습니다.

Levenshtein 편집 거리;
Damerau-Levenshtein 거리;
Jaro-Winkler 유사성;
가장 긴 Common Subsequence 편집 거리;
Q-Gram (Ukkonen);
n- 그램 거리 (Kondrak);
Jaccard 색인;
Sorensen-Dice 계수;
코사인 유사성;
...

이에 대한 설명 및 Java 구현은 https://github.com/tdebatty/java-string-similarity 에서 찾을 수 있습니다.

— 티볼트 토론
소스

8

아파치 공용 자바 라이브러리를 사용하여이를 달성 할 수 있습니다 . 그 안에서 다음 두 함수를 살펴보십시오.
- getLevenshteinDistance
- getFuzzyDistance

— 노 엘리 쿠스
소스

3

2017 년 10 월부터 연결된 메서드는 더 이상 사용되지 않습니다. 대신 공용 텍스트 라이브러리 의 LevenshteinDistance 및 FuzzyScore 클래스를 사용하십시오.

— vatbub

3

이론적으로 편집 거리를 비교할 수 있습니다 .

— 안톤 고골 레프
소스

3

이것은 일반적으로 편집 거리 측정을 사용하여 수행됩니다 . "편집 거리 자바"를 검색하면 같은 라이브러리의 수를 회전 이 하나 .

— 로렌스 곤살 베스
소스

3

당신의 문자열이 문서로 바뀌면 표절 찾기 처럼 들립니다 . 해당 용어로 검색하면 좋은 결과를 얻을 수 있습니다.

"집단 지능 프로그래밍"에는 두 문서가 유사한 지 여부를 결정하는 장이 있습니다. 코드는 Python으로되어 있지만 깔끔하고 이식하기 쉽습니다.

— 더 피모
소스

3

첫 번째 답변자 덕분에 computeEditDistance (s1, s2) 계산에는 두 가지가 있다고 생각합니다. 많은 시간을 소비했기 때문에 코드의 성능을 향상 시키기로 결정했습니다. 그래서:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}

— 모센 아바시
소스

0

z 알고리즘을 사용하여 문자열에서 유사성을 찾을 수도 있습니다. 여기를 클릭하십시오 https://teakrunch.com/2020/05/09/string-similarity-hackerrank-challenge/

— 아툴 사무엘
소스

자바의 유사성 문자열 비교

계산 editDistance():

작업 예 :

계산 `editDistance()`: