Python : tf-idf-cosine : 문서 유사성 찾기

Question 1

Part 1 & Part 2 에서 사용할 수있는 튜토리얼을 따르고있었습니다 . 불행히도 저자는 실제로 두 문서 사이의 거리를 찾기 위해 코사인 유사성을 사용하는 마지막 섹션에 대한 시간이 없었습니다. 나는 stackoverflow 의 다음 링크의 도움으로 기사의 예제를 따랐 습니다. 위 링크에 언급 된 코드가 포함되어 있습니다 (삶을 더 쉽게 만들기 위해)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."]  # Documents
test_set = ["The sun in the sky is bright."]  # Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

위의 코드의 결과로 다음과 같은 매트릭스가 있습니다.

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

코사인 유사성을 계산하기 위해이 출력을 사용하는 방법을 잘 모르겠습니다. 길이가 비슷한 두 벡터에 대해 코사인 유사성을 구현하는 방법을 알고 있지만 여기서는 두 벡터를 식별하는 방법을 잘 모르겠습니다.

Question 2

먼저 카운트 기능을 추출하고 TF-IDF 정규화 및 행 단위 유클리드 정규화를 적용하려면 다음을 사용하여 한 번의 작업으로 수행 할 수 있습니다 TfidfVectorizer.

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> twenty = fetch_20newsgroups()

>>> tfidf = TfidfVectorizer().fit_transform(twenty.data)
>>> tfidf
<11314x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 1787553 stored elements in Compressed Sparse Row format>

이제 한 문서 (예 : 데이터 세트의 첫 번째 문서)와 다른 모든 문서의 코사인 거리를 찾으려면 tfidf 벡터가 이미 행 정규화되어 있으므로 첫 번째 벡터의 내적을 다른 모든 문서와 계산하면됩니다.

Chris Clark이 주석과 여기 에서 설명했듯이 코사인 유사성은 벡터의 크기를 고려하지 않습니다. 행 정규화의 크기는 1이므로 선형 커널은 유사성 값을 계산하기에 충분합니다.

scipy 희소 행렬 API는 약간 이상합니다 (밀도 N 차원 numpy 배열만큼 유연하지 않음). 첫 번째 벡터를 얻으려면 행 단위로 행렬을 슬라이스하여 단일 행이있는 부분 행렬을 얻어야합니다.

>>> tfidf[0:1]
<1x130088 sparse matrix of type '<type 'numpy.float64'>'
    with 89 stored elements in Compressed Sparse Row format>

scikit-learn은 이미 벡터 컬렉션의 조밀 한 표현과 희소 표현 모두에 대해 작동하는 쌍 단위 메트릭 (머신 학습 용어로 커널)을 제공합니다. 이 경우 선형 커널이라고도하는 내적이 필요합니다.

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

따라서 상위 5 개의 관련 문서를 찾기 위해 argsort몇 가지 음의 배열 슬라이싱을 사용할 수 있습니다 (대부분의 관련 문서는 가장 높은 코사인 유사성 값을 가지므로 정렬 된 인덱스 배열의 끝에 있음).

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

첫 번째 결과는 온 전성 검사입니다. 다음 텍스트가있는 코사인 유사성 점수가 1 인 가장 유사한 문서 인 쿼리 문서를 찾습니다.

>>> print twenty.data[0]
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----

두 번째로 가장 유사한 문서는 원본 메시지를 인용하는 회신이므로 많은 공통 단어가 있습니다.

>>> print twenty.data[958]
From: rseymour@reed.edu (Robert Seymour)
Subject: Re: WHAT car is this!?
Article-I.D.: reed.1993Apr21.032905.29286
Reply-To: rseymour@reed.edu
Organization: Reed College, Portland, OR
Lines: 26

In article <1993Apr20.174246.14375@wam.umd.edu> lerxst@wam.umd.edu (where's my
thing) writes:
>
>  I was wondering if anyone out there could enlighten me on this car I saw
> the other day. It was a 2-door sports car, looked to be from the late 60s/
> early 70s. It was called a Bricklin. The doors were really small. In
addition,
> the front bumper was separate from the rest of the body. This is
> all I know. If anyone can tellme a model name, engine specs, years
> of production, where this car is made, history, or whatever info you
> have on this funky looking car, please e-mail.

Bricklins were manufactured in the 70s with engines from Ford. They are rather
odd looking with the encased front bumper. There aren't a lot of them around,
but Hemmings (Motor News) ususally has ten or so listed. Basically, they are a
performance Ford with new styling slapped on top.

>    ---- brought to you by your neighborhood Lerxst ----

Rush fan?

--
Robert Seymour              rseymour@reed.edu
Physics and Philosophy, Reed College    (NeXTmail accepted)
Artificial Life Project         Reed College
Reed Solar Energy Project (SolTrain)    Portland, OR

Question 3

@excray의 의견을 통해 답을 찾을 수있었습니다. 우리가해야 할 일은 실제로 기차 데이터와 테스트 데이터를 나타내는 두 배열을 반복하는 간단한 for 루프를 작성하는 것입니다.

먼저 코사인 계산을위한 공식을 유지하는 간단한 람다 함수를 구현합니다.

cosine_function = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

그런 다음 to 벡터를 반복하는 간단한 for 루프를 작성하면 논리는 모든 "trainVectorizerArray의 각 벡터에 대해 testVectorizerArray의 벡터와 코사인 유사성을 찾아야합니다."

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.corpus import stopwords
import numpy as np
import numpy.linalg as LA

train_set = ["The sky is blue.", "The sun is bright."] #Documents
test_set = ["The sun in the sky is bright."] #Query
stopWords = stopwords.words('english')

vectorizer = CountVectorizer(stop_words = stopWords)
#print vectorizer
transformer = TfidfTransformer()
#print transformer

trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
testVectorizerArray = vectorizer.transform(test_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
print 'Transform Vectorizer to test set', testVectorizerArray
cx = lambda a, b : round(np.inner(a, b)/(LA.norm(a)*LA.norm(b)), 3)

for vector in trainVectorizerArray:
    print vector
    for testV in testVectorizerArray:
        print testV
        cosine = cx(vector, testV)
        print cosine

transformer.fit(trainVectorizerArray)
print
print transformer.transform(trainVectorizerArray).toarray()

transformer.fit(testVectorizerArray)
print 
tfidf = transformer.transform(testVectorizerArray)
print tfidf.todense()

다음은 출력입니다.

Fit Vectorizer to train set [[1 0 1 0]
 [0 1 0 1]]
Transform Vectorizer to test set [[0 1 1 1]]
[1 0 1 0]
[0 1 1 1]
0.408
[0 1 0 1]
[0 1 1 1]
0.816

[[ 0.70710678  0.          0.70710678  0.        ]
 [ 0.          0.70710678  0.          0.70710678]]

[[ 0.          0.57735027  0.57735027  0.57735027]]

Question 4

나는 그것의 오래된 포스트를 안다. 하지만 http://scikit-learn.sourceforge.net/stable/ 패키지를 시도했습니다 . 여기에 코사인 유사성을 찾는 코드가 있습니다. 문제는이 패키지와의 코사인 유사성을 어떻게 계산할 것인가였으며 여기에 대한 내 코드가 있습니다.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

f = open("/root/Myfolder/scoringDocuments/doc1")
doc1 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc2")
doc2 = str.decode(f.read(), "UTF-8", "ignore")
f = open("/root/Myfolder/scoringDocuments/doc3")
doc3 = str.decode(f.read(), "UTF-8", "ignore")

train_set = ["president of India",doc1, doc2, doc3]

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_set)  #finds the tfidf score with normalization
print "cosine scores ==> ",cosine_similarity(tfidf_matrix_train[0:1], tfidf_matrix_train)  #here the first element of tfidf_matrix_train is matched with other three elements

여기에서 쿼리가 train_set의 첫 번째 요소이고 doc1, doc2 및 doc3이 코사인 유사성의 도움으로 순위를 매기려는 문서라고 가정합니다. 이 코드를 사용할 수 있습니다.

또한 질문에 제공된 자습서는 매우 유용했습니다. 여기에 대한 모든 부분이 있습니다 part-I , part-II , part-III

출력은 다음과 같습니다.

[[ 1.          0.07102631  0.02731343  0.06348799]]

여기서 1은 쿼리가 자체와 일치 함을 나타내고 나머지 3 개는 쿼리를 각 문서와 일치시키는 점수입니다.

Question 5

제가 작성한 또 다른 튜토리얼을 알려 드리겠습니다. 그것은 귀하의 질문에 대한 답을 제공하지만 우리가 왜 일부 일을하고 있는지 설명합니다. 나는 또한 그것을 간결하게 만들려고 노력했다.

그래서 당신 list_of_documents은 단지 문자열의 배열이고 다른 document하나는 단지 문자열입니다. list_of_documents에서 가장 유사한 문서를 찾아야 합니다 document.

함께 결합 해 보겠습니다. documents = list_of_documents + [document]

종속성부터 시작하겠습니다. 우리가 각각을 사용하는 이유가 분명해질 것입니다.

from nltk.corpus import stopwords
import string
from nltk.tokenize import wordpunct_tokenize as tokenize
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine

사용할 수있는 접근 방식 중 하나는 bag-of-words 접근 방식입니다. 여기서 문서의 각 단어를 다른 단어와 독립적으로 취급하고 모든 단어를 큰 가방에 함께 넣습니다. 한 관점에서는 단어가 연결되는 방식과 같은 많은 정보를 잃어 버리지 만 다른 관점에서는 모델을 단순하게 만듭니다.

영어와 다른 인간의 언어에는 'a', 'the', 'in'과 같은 "쓸모없는"단어가 너무나 흔해서 의미가 많지 않습니다. 그들 불리는 중지 단어를 그리고 그들을 제거하는 좋은 아이디어이다. 또 하나 눈에 띄는 것은 '분석', '분석자', '분석'과 같은 단어가 정말 비슷하다는 것입니다. 그것들은 공통된 어근을 가지고 있으며 모두 하나의 단어로 변환 될 수 있습니다. 이 프로세스를 형태소 분석 이라고 하며 속도, 공격성 등이 다른 여러 형태소 분석기가 있습니다. 그래서 우리는 각 문서를 불용어가없는 단어의 어간 목록으로 변환합니다. 또한 모든 구두점을 삭제합니다.

porter = PorterStemmer()
stop_words = set(stopwords.words('english'))

modified_arr = [[porter.stem(i.lower()) for i in tokenize(d.translate(None, string.punctuation)) if i.lower() not in stop_words] for d in documents]

그렇다면이 단어들이 우리에게 어떻게 도움이 될까요? 우리가 3 봉지를 상상해 [a, b, c], [a, c, a]및 [b, c, d]. 기본적 으로 벡터 로 변환 할 수 있습니다. [a, b, c, d] . 그래서 우리는 벡터로 끝납니다 : [1, 1, 1, 0], [2, 0, 1, 0]그리고 [0, 1, 1, 1]. 비슷한 것은 우리의 문서에서도 마찬가지입니다 (벡터 만 더 길어질 것입니다). 이제 우리는 많은 단어를 제거하고 벡터의 차원을 줄이기 위해 다른 단어도 제거했음을 알 수 있습니다. 여기에 흥미로운 관찰이 있습니다. 긴 문서는 짧은 것보다 더 많은 양의 요소를 가지므로 벡터를 정규화하는 것이 좋습니다. 이를 빈도 TF라는 용어라고하며, 사람들은 다른 문서에서 단어가 얼마나 자주 사용되는지에 대한 추가 정보 (역 문서 빈도 IDF)를 사용했습니다. 함께 우리는 몇 가지 특징을 가진 메트릭 TF-IDF를 가지고 있습니다.. 이것은 sklearn에서 한 줄로 달성 할 수 있습니다 :-)

modified_doc = [' '.join(i) for i in modified_arr] # this is only to convert our list of lists to list of strings that vectorizer uses.
tf_idf = TfidfVectorizer().fit_transform(modified_doc)

실제로 벡터 라이저 사용하면 제거 및 소문자와 같은 많은 작업을 수행 할 수 있습니다 . sklearn에는 영어가 아닌 불용어가 없기 때문에 별도의 단계로 수행했지만 nltk에는 있습니다.

그래서 우리는 모든 벡터를 계산했습니다. 마지막 단계는 마지막 단계와 가장 유사한 것을 찾는 것입니다. 이를 달성하는 데는 여러 가지 방법이 있습니다. 그중 하나는 여기에서 논의한 이유로 그리 크지 않은 유클리드 거리입니다 . 또 다른 접근 방식은 코사인 유사성 입니다. 모든 문서를 반복하고 문서와 마지막 문서 간의 코사인 유사성을 계산합니다.

l = len(documents) - 1
for i in xrange(l):
    minimum = (1, None)
    minimum = min((cosine(tf_idf[i].todense(), tf_idf[l + 1].todense()), i), minimum)
print minimum

이제 최소는 최고의 문서와 점수에 대한 정보를 갖게됩니다.

Question 6

이것은 당신을 도울 것입니다.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity  

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(train_set)
print tfidf_matrix
cosine = cosine_similarity(tfidf_matrix[length-1], tfidf_matrix)
print cosine

출력은 다음과 같습니다.

[[ 0.34949812  0.81649658  1.        ]]

Question 7

다음은 훈련 데이터에 맞는 Tf-Idf 변환기를 사용하여 테스트 데이터를 훈련 데이터와 비교하는 함수입니다. 장점은 n 개의 가장 가까운 요소를 찾기 위해 빠르게 피벗하거나 그룹화 할 수 있고 계산이 행렬 방식으로 내려 간다는 것입니다.

def create_tokenizer_score(new_series, train_series, tokenizer):
    """
    return the tf idf score of each possible pairs of documents
    Args:
        new_series (pd.Series): new data (To compare against train data)
        train_series (pd.Series): train data (To fit the tf-idf transformer)
    Returns:
        pd.DataFrame
    """

    train_tfidf = tokenizer.fit_transform(train_series)
    new_tfidf = tokenizer.transform(new_series)
    X = pd.DataFrame(cosine_similarity(new_tfidf, train_tfidf), columns=train_series.index)
    X['ix_new'] = new_series.index
    score = pd.melt(
        X,
        id_vars='ix_new',
        var_name='ix_train',
        value_name='score'
    )
    return score

train_set = pd.Series(["The sky is blue.", "The sun is bright."])
test_set = pd.Series(["The sun in the sky is bright."])
tokenizer = TfidfVectorizer() # initiate here your own tokenizer (TfidfVectorizer, CountVectorizer, with stopwords...)
score = create_tokenizer_score(train_series=train_set, new_series=test_set, tokenizer=tokenizer)
score

   ix_new   ix_train    score
0   0       0       0.617034
1   0       1       0.862012