scikit-learn을 사용하여 여러 범주로 분류

Question 1

scikit-learn의지도 학습 방법 중 하나를 사용하여 텍스트를 하나 이상의 범주로 분류하려고합니다. 내가 시도한 모든 알고리즘의 예측 기능은 하나의 일치를 반환합니다.

예를 들어 텍스트가 있습니다.

"Theaters in New York compared to those in London"

그리고 내가 피드하는 모든 텍스트 스 니펫의 위치를 선택하도록 알고리즘을 훈련 시켰습니다.

위의 예에서는 New Yorkand 를 반환하기를 원 London하지만 New York.

scikit-learn을 사용하여 여러 결과를 반환 할 수 있습니까? 아니면 다음으로 높은 확률로 라벨을 반환할까요?

당신의 도움을 주셔서 감사합니다.

---최신 정보

나는 사용해 OneVsRestClassifier보았지만 여전히 텍스트 조각 당 하나의 옵션 만 반환됩니다. 아래는 내가 사용하는 샘플 코드입니다.

y_train = ('New York','London')


train_set = ("new york nyc big apple", "london uk great britain")
vocab = {'new york' :0,'nyc':1,'big apple':2,'london' : 3, 'uk': 4, 'great britain' : 5}
count = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2),vocabulary=vocab)
test_set = ('nice day in nyc','london town','hello welcome to the big apple. enjoy it here and london too')

X_vectorized = count.transform(train_set).todense()
smatrix2  = count.transform(test_set).todense()


base_clf = MultinomialNB(alpha=1)

clf = OneVsRestClassifier(base_clf).fit(X_vectorized, y_train)
Y_pred = clf.predict(smatrix2)
print Y_pred

결과 : [ 'New York' 'London' 'London']

Question 2

원하는 것은 다중 레이블 분류입니다. Scikits-learn은 그렇게 할 수 있습니다. 여기를 참조하십시오 : http://scikit-learn.org/dev/modules/multiclass.html .

귀하의 예에서 무엇이 잘못되었는지 잘 모르겠습니다. 제 sklearn 버전에는 WordNGramAnalyzer가없는 것 같습니다. 아마도 더 많은 훈련 예제를 사용하거나 다른 분류기를 시도하는 질문일까요? 다중 레이블 분류자는 대상이 튜플 목록 / 레이블 목록 일 것으로 예상합니다.

다음은 나를 위해 작동합니다.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

나를 위해 이것은 출력을 생성합니다.

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London

도움이 되었기를 바랍니다.

Question 3

편집 : 제안 된대로 MultiLabelBinarizer를 사용하여 Python 3, scikit-learn 0.18.1 용으로 업데이트되었습니다.

나는 이것에 대해서도 작업하고 있으며 유용 할 수있는 mwv의 훌륭한 답변을 약간 개선했습니다. 이진 레이블이 아닌 텍스트 레이블을 입력으로 사용하고 MultiLabelBinarizer를 사용하여 인코딩합니다.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

mlb = MultiLabelBinarizer()
Y = mlb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = mlb.inverse_transform(predicted)

for item, labels in zip(X_test, all_labels):
    print('{0} => {1}'.format(item, ', '.join(labels)))

이것은 다음과 같은 출력을 제공합니다.

nice day in nyc => new york
welcome to london => london
london is rainy => london
it is raining in britian => london
it is raining in britian and the big apple => new york
it is raining in britian and nyc => london, new york
hello welcome to new york. enjoy it here and london too => london, new york

Question 4

나는 이것도 만났고, 나에게 문제는 내 y_Train이 String의 시퀀스가 아니라 String의 시퀀스라는 것입니다. 분명히 OneVsRestClassifier는 입력 레이블 형식에 따라 다중 클래스와 다중 레이블을 사용할지 여부를 결정합니다. 따라서 변경 :

y_train = ('New York','London')

...에

y_train = (['New York'],['London'])

분명히 이것은 모든 레이블의 중단이 동일하기 때문에 미래에 사라질 것입니다 : https://github.com/scikit-learn/scikit-learn/pull/1987

Question 5

새 버전의 파이썬에서 작동하도록이 줄을 변경하십시오.

# lb = preprocessing.LabelBinarizer()
lb = preprocessing.MultiLabelBinarizer()

Question 6

몇 가지 다중 분류 예는 다음과 같습니다.

예 1 :-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array([1, 2, 3,4,5,6,7,8,9,10,11,12,13,14,1])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

출력은

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0]]

예 2 :-

import numpy as np
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()

arr2d = np.array(['Leopard','Lion','Tiger', 'Lion'])
transfomed_label = encoder.fit_transform(arr2d)
print(transfomed_label)

출력은

[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 1 0]]