CalibratedClassifierCV를 사용하여 분류자를 교정하는 올바른 Scikit 방법

16

Scikit에는 CalibratedClassifierCV 가있어 특정 X, y 쌍에서 모델을 교정 할 수 있습니다. 또한 명확하게data for fitting the classifier and for calibrating it must be disjoint.

그들이 분리되어 있어야한다면, 분류기를 다음과 같이 훈련시키는 것이 합법적인가?

model = CalibratedClassifierCV(my_classifier)
model.fit(X_train, y_train)

나는 동일한 훈련 세트를 사용함으로써 disjoint data규칙을 어 기고 있다는 것을 두려워합니다 . 대안은 유효성 검사 세트를 가질 수 있습니다.

my_classifier.fit(X_train, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_valid, y_valid)

훈련을 위해 더 적은 데이터를 남기는 단점이 있습니다. 또한 CalibratedClassifierCV 가 다른 학습 세트에 맞는 모델에만 적합해야하는 경우 cv=3기본 추정값이 왜 기본 추정기에 적합합니까? 교차 유효성 검사는 분리 규칙을 자체적으로 처리합니까?

질문 : CalibratedClassifierCV를 사용하는 올바른 방법은 무엇입니까?

— sapo_cosmico
소스

18

CalibratedClassifierCV 문서 에는 두 가지 방법으로 사용할 수있는 방법이 있습니다.

base_estimator : cv = prefit 인 경우 분류자가 이미 데이터에 적합해야합니다.

cv : “prefit”이 통과되면 base_estimator가 이미 장착 된 것으로 가정하고 모든 데이터를 교정에 사용합니다.

분명히 이것을 잘못 해석했을 수도 있지만 CCCV (CalibratedClassifierCV의 약자)를 두 가지 방법으로 사용할 수 있습니다.

첫번째:

평소처럼 모델을 훈련시킵니다 your_model.fit(X_train, y_train).
그런 다음 CCCV 인스턴스를 작성하십시오 your_cccv = CalibratedClassifierCV(your_model, cv='prefit'). cv모델이 이미 적합했음을 표시하도록 설정 했습니다.
마지막으로을 호출 your_cccv.fit(X_validation, y_validation)합니다. 이 검증 데이터는 교정 목적으로 만 사용됩니다.

두 번째 :

훈련되지 않은 새로운 모델이 있습니다.
그런 다음을 만듭니다 your_cccv=CalibratedClassifierCV(your_untrained_model, cv=3). 공지 사항은 cv이제 주름의 수입니다.
마지막으로을 호출 your_cccv.fit(X, y)합니다. 모델이 훈련되지 않았으므로 훈련과 교정 모두에 X와 y를 사용해야합니다. 데이터가 '비 연속적'임을 확인하는 방법은 교차 검증입니다. 특정 접힘에 대해 CCCV는 X와 y를 훈련 및 교정 데이터로 분할하므로 중복되지 않습니다.

TLDR : 방법 1을 사용하면 교육 및 교정에 사용되는 것을 제어 할 수 있습니다. 방법 2는 교차 검증을 사용하여 두 가지 목적을 위해 데이터를 최대한 활용하려고합니다.

— 핀 타스
소스

14

이 질문에도 관심이 있으며 CalibratedClassifierCV (CCCV)를 더 잘 이해하기 위해 몇 가지 실험을 추가하고 싶었습니다.

이미 말했듯이, 그것을 사용하는 두 가지 방법이 있습니다.

#Method 1, train classifier within CCCV
model = CalibratedClassifierCV(my_clf)
model.fit(X_train_val, y_train_val)

#Method 2, train classifier and then use CCCV on DISJOINT set
my_clf.fit(X_train, y_train)
model = CalibratedClassifierCV(my_clf, cv='prefit')
model.fit(X_val, y_val)

또는 두 번째 방법을 시도해 볼 수 있지만 동일한 데이터를 사용하여 보정하면됩니다.

#Method 2 Non disjoint, train classifier on set, then use CCCV on SAME set used for training
my_clf.fit(X_train_val, y_train_val)
model = CalibratedClassifierCV(my_clf, cv='prefit')
model.fit(X_train_val, y_train_val)

문서에서 비 연속 세트를 사용하도록 경고하지만,이를 통해 검사 할 수 있기 때문에 유용 할 수 있습니다 my_clf(예 : coef_CalibratedClassifierCV 객체에서 사용할 수없는). (교정 된 분류 자에서 이것을 얻는 방법을 아는 사람이 있습니까?

나는이 3 가지 방법을 완전히 테스트 된 테스트 세트에서 교정 측면에서 비교하기로 결정했습니다.

데이터 세트는 다음과 같습니다.

X, y = datasets.make_classification(n_samples=500, n_features=200,
                                    n_informative=10, n_redundant=10,
                                    #random_state=42, 
                                    n_clusters_per_class=1, weights = [0.8,0.2])

나는 약간의 클래스 불균형을 던졌고 500 개의 샘플 만 제공하여 이것을 어려운 문제로 만들었습니다.

각 방법을 시도하고 교정 곡선을 그릴 때마다 100 번의 시험을 실행합니다.

Brier의 Boxplots는 모든 시험에서 점수를 매 깁니다.

샘플 수를 10,000으로 늘리기 :

분류자를 Naive Bayes로 변경하면 500 개 샘플로 돌아갑니다.

교정하기에 충분한 샘플이 아닌 것 같습니다. 샘플을 10,000으로 증가

전체 코드

print(__doc__)

# Based on code by Alexandre Gramfort <alexandre.gramfort@telecom-paristech.fr>
#         Jan Hendrik Metzen <jhm@informatik.uni-bremen.de>

import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.model_selection import train_test_split


def plot_calibration_curve(clf, name, ax, X_test, y_test, title):

    y_pred = clf.predict(X_test)
    if hasattr(clf, "predict_proba"):
        prob_pos = clf.predict_proba(X_test)[:, 1]
    else:  # use decision function
        prob_pos = clf.decision_function(X_test)
        prob_pos = \
            (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

    clf_score = brier_score_loss(y_test, prob_pos, pos_label=y.max())

    fraction_of_positives, mean_predicted_value = \
        calibration_curve(y_test, prob_pos, n_bins=10, normalize=False)

    ax.plot(mean_predicted_value, fraction_of_positives, "s-",
             label="%s (%1.3f)" % (name, clf_score), alpha=0.5, color='k', marker=None)

    ax.set_ylabel("Fraction of positives")
    ax.set_ylim([-0.05, 1.05])
    ax.set_title(title)

    ax.set_xlabel("Mean predicted value")

    plt.tight_layout()
    return clf_score

    fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, ncols=1, figsize=(6,12))

    ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated",)
    ax2.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
    ax3.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")

    scores = {'Method 1':[],'Method 2':[],'Method 3':[]}


fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, ncols=1, figsize=(6,12))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated",)
ax2.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
ax3.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")

scores = {'Method 1':[],'Method 2':[],'Method 3':[]}

for i in range(0,100):

    X, y = datasets.make_classification(n_samples=10000, n_features=200,
                                        n_informative=10, n_redundant=10,
                                        #random_state=42, 
                                        n_clusters_per_class=1, weights = [0.8,0.2])

    X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.80,
                                                        #random_state=42
                                                               )

    X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.80,
                                                      #random_state=42
                                                     )

    #my_clf = GaussianNB()
    my_clf = LogisticRegression()

    #Method 1, train classifier within CCCV
    model = CalibratedClassifierCV(my_clf)
    model.fit(X_train_val, y_train_val)
    r = plot_calibration_curve(model, "all_cal", ax1, X_test, y_test, "Method 1")
    scores['Method 1'].append(r)

    #Method 2, train classifier and then use CCCV on DISJOINT set
    my_clf.fit(X_train, y_train)
    model = CalibratedClassifierCV(my_clf, cv='prefit')
    model.fit(X_val, y_val)
    r = plot_calibration_curve(model, "all_cal", ax2, X_test, y_test, "Method 2")
    scores['Method 2'].append(r)

    #Method 3, train classifier on set, then use CCCV on SAME set used for training
    my_clf.fit(X_train_val, y_train_val)
    model = CalibratedClassifierCV(my_clf, cv='prefit')
    model.fit(X_train_val, y_train_val)
    r = plot_calibration_curve(model, "all_cal", ax3, X_test, y_test, "Method 2 non Dis")
    scores['Method 3'].append(r)

import pandas
b = pandas.DataFrame(scores).boxplot()
plt.suptitle('Brier score')

따라서 Brier 점수 결과는 결정적이지 않지만 곡선에 따르면 두 번째 방법을 사용하는 것이 가장 좋습니다.

— 사용자 0
소스