클래스 불균형 문제의 근본 원인은 무엇입니까?

나는 최근 기계 / 통계학 학습에서 "클래스 불균형 문제"에 대해 많은 생각을하고 있었고, 무슨 일이 일어나고 있는지 이해하지 못하는 느낌에 더 깊이 빠져들고 있습니다.

먼저 용어를 정의 (또는 정의)하려고합니다.

클래스 불균형 문제 기계 / 통계적 학습 1 등급 0 등급의 비율이 매우 기울어 진 경우 일부 분류 (*) 알고리즘이 잘 수행하지 않는다는 것을 관찰한다.

예를 들어, 위의 예에서, 매 클래스 마다 100 개의 $0$ 클래스 가 있다면 , 클래스 불균형이 에서 또는 라고 말할 것입니다 $1$ $1$ $100$ $1\%$ .

내가 본 문제에 대한 대부분의 진술은 충분한 자격으로 생각할만한 것 (모델이 어려움, 불균형이 문제임)으로 생각할 수있는 것이 부족하며 이것이 혼란의 근원이다.

기계 / 통계 학습의 표준 텍스트에 대한 조사는 거의 이루어지지 않습니다.

통계적 기대의 요소 및 통계 학습 소개 는 지수에 "클래스 불균형"을 포함하지 않습니다.
Predictive Data Analytics의 기계 학습 에는 인덱스에 "클래스 불균형"도 포함되어 있지 않습니다.
Murphy의 머신 러닝 : 확률 론적 관점 은 인덱스에 "클래스 불균형 *"을 포함합니다. 참조는 SVM에 관한 섹션으로, 다음과 같은 열렬한 의견을 찾았습니다.

SVM이 확률을 사용하여 불확실성을 모델링하지 않기 때문에 근본적으로 발생하는 이러한 모든 어려움과이를 해결하기 위해 제안 된 수많은 휴리스틱을 기억할 가치가 있습니다.

이 의견은 직관과 경험으로 가득 차 있습니다. 이전 직장에서 우리는 로지스틱 회귀 분석 및 기울기 향상 트리 모델 (이항 로그 우도를 최소화하기 위해)을 불균형 데이터 ( )에 일상적으로 맞출 것입니다 $1\%$ 클래스 불균형 명백한 성능 문제.

분류 트리 기반 모델 (트리 자체 및 임의 포리스트)을 읽었습니다 (어딘가) 도 클래스 불균형 문제로 어려움을 겪었습니다. 이것은 물을 약간 흐릿하게하고, 나무는 어떤 의미에서는 확률을 반환합니다 : 나무의 각 터미널 노드에서 대상 클래스에 대한 투표 기록.

결론적으로, 내가 실제로 추구하는 것은 개념적 이해입니다. 따르는 것은 계급 불균형 문제 (있는 경우)를 야기하는 힘에 입니다.

잘못 선택한 알고리즘과 게으른 기본 분류 임계 값으로 우리 자신에게하는 일입니까?
적절한 점수 기준을 최적화하는 확률 모델을 항상 맞추면 사라 집니까? 다르게 말하면, 원인은 단순히 손실 함수의 선택이 좋지 않은가, 즉 엄격한 분류 규칙과 전체 정확도를 기반으로 모델의 예측력을 평가하는 것입니까?
그렇다면 적절한 점수 규칙을 최적화하지 않는 모델은 쓸모가 없습니까 (적어도 유용하지는 않습니까)?

(*) 분류에 따르면 이항 반응 데이터에 맞는 통계 모델을 의미합니다. 나는 하지 가있을 수 있지만 내 목표는 하나 개의 클래스 또는 다른 하드 과제라고 가정.

— 매튜 드 루리
소스

학습자가 각 클래스의 손실을 동일하게 처벌 할 때 명백한 문제가 발생할 수 있습니다. 이론적으로 같은 클래스가 모든 것을 반환하면 총 손실을 최소화 할 수 있습니다.

— Firebug

poor choice of loss function내 목록 에 추가 하는 것을 잊었습니다 . 그렇다면 손실 함수로 적절한 점수 규칙을 사용하더라도 이것이 사실이라고 생각합니까?

— Matthew Drury

나도 그렇게 생각해. 더 큰 계급의 손실을 최소화하는 것만으로 전체 문제의 손실을 최소화 할뿐 아니라 일반적으로 소수 계급이 더 큰 관심을 갖는 문제를 공식화 할 수 있다고 생각합니다.

— Firebug

I agree with the sentiments of the question. I've had a working hypothesis (though happy to reject it) that there is no class imbalance problem per se, just that we train with loss functions that don't represent what we will use to measure success on test data. And it's hard to call this a mistake, as it's almost standard practice: e.g. it's not standard to directly optimize AUC or F1 score, but those are common success metrics for problems with class imbalance. So maybe that's the class imbalance problem?

— DavidR

The cause of the class imbalance problem is the convention to use accuracy as a loss function. class imbalance is a problem characteristic (rare disease diagnostic for example), that can be dealt with using several strategies. Using a class weight inverse proportional to the class size when computing the loss function is one of them. Other than that, AUC as a loss function is a good idea since it specifically distinguished between true-positive and false-positive. Therefore the core issue of the class imbalance problem is the loss function. Great question though, which I don't dare to answer.

— Nikolas Rieble

답변:

An entry from the Encyclopedia of Machine Learning (https://cling.csd.uwo.ca/papers/cost_sensitive.pdf) helpfully explains that what gets called "the class imbalance problem" is better understood as three separate problems:

 (1) assuming that an accuracy metric is appropriate when it is not

 (2) assuming that the test distribution matches the training 
     distribution when it does not

 (3) assuming that you have enough minority class data when you do not

The authors explain:

The class imbalanced datasets occurs in many real-world applications where the class distributions of data are highly imbalanced. Again, without loss of generality, we assume that the minority or rare class is the positive class, and the majority class is the negative class. Often the minority class is very small, such as 1%of the dataset. If we apply most traditional (cost-insensitive) classifiers on the dataset, they will likely to predict everything as negative (the majority class). This was often regarded as a problem in learning from highly imbalanced datasets.

However, as pointed out by (Provost, 2000), two fundamental assumptions are often made in the traditional cost-insensitive classifiers. The first is that the goal of the classifiers is to maximize the accuracy (or minimize the error rate); the second is that the class distribution of the training and test datasets is the same. Under these two assumptions, predicting everything as negative for a highly imbalanced dataset is often the right thing to do. (Drummond and Holte, 2005) show that it is usually very difficult to outperform this simple classifier in this situation.

Thus, the imbalanced class problem becomes meaningful only if one or both of the two assumptions above are not true; that is, if the cost of different types of error (false positive and false negative in the binary classification) is not the same, or if the class distribution in the test data is different from that of the training data. The first case can be dealt with effectively using methods in cost-sensitive meta-learning.

In the case when the misclassification cost is not equal, it is usually more expensive to misclassify a minority (positive) example into the majority (negative) class, than a majority example into the minority class (otherwise it is more plausible to predict everything as negative). That is, FN > FP. Thus, given the values of FN and FP, a variety of cost-sensitive meta-learning methods can be, and have been, used to solve the class imbalance problem (Ling and Li, 1998; Japkowicz and Stephen, 2002). If the values of FN and FP are not unknown explicitly, FN and FP can be assigned to be proportional to p(-):p(+) (Japkowicz and Stephen, 2002).

In case the class distributions of training and test datasets are different (for example, if the training data is highly imbalanced but the test data is more balanced), an obvious approach is to sample the training data such that its class distribution is the same as the test data (by oversampling the minority class and/or undersampling the majority class)(Provost, 2000).

Note that sometimes the number of examples of the minority class is too small for classifiers to learn adequately. This is the problem of insufficient (small) training data, different from that of the imbalanced datasets.

Thus, as Murphy implies, there is nothing inherently problematic about using imbalanced classes, provided you avoid these three mistakes. Models that yield posterior probabilities make it easier to avoid error (1) than do discriminant models like SVM because they enable you to separate inference from decision-making. (See Bishop's section 1.5.4 Inference and Decision for further discussion of that last point.)

Hope that helps.

— Bill Vander Lugt
소스

I was going to post something similar. one small comment - I think it is crazy to undersample the larger class. This is throwing away your data, and surely won't provide a better outcome. I like the notion of splitting up inference and classification. the inference part is not affected by imbalance, but decision making (classification) can be greatly affected.

— probabilityislogic

@probabilityislogic (and Bill Vander Lugt): There is another possible problem that is not discussed in that text: whether a discriminative Ansatz is adequate. Inadequately going for a discriminative model where one-class would be more appropriate can also lead to "class imbalance problems".

— cbeleites supports Monica

Anything that involves optimization to minimize a loss function will, if sufficiently convex, give a solution that is a global minimum of that loss function. I say 'sufficiently convex' since deep networks are not on the whole convex, but give reasonable minimums in practice, with careful choices of learning rate etc.

Therefore, the behavior of such models is defined by whatever we put in the loss function.

Imagine that we have a model, $F$ , that assigns some arbitrary real scalar to each example, such that more negative values tend to indicate class A, and more positive numbers tend to indicate class B.

y_{f} = f (x)

$y_f = f(\mathbf{x})$

We use $F$ to create model $G$ , which assigns a threshold, $b$ , to the output of $F$ , implicitly or explicitly, such that when $F$ outputs a value greater than $b$ then model $G$ predicts class B, else it predicts class A.

y_{g} = {\begin{cases} B & if f (x) > b \\ A & otherwise \end{cases}

$y_g = \begin{cases} B & \text{if } f(\mathbf{x}) > b \\ A & \text{otherwise}\\ \end{cases}$

By varying the threshold $b$ that model $G$ learns, we can vary the proportion of examples that are classified as class A or class B. We can move along a curve of precision/recall, for each class. A higher threshold gives lower recall for class B, but probably higher precision.

Imagine that the model $F$ is such that if we choose a threshold that gives equal precision and recall to either class, then the accuracy of model G is 90%, for either class (by symmetry). So, given a training example, $G$ would get the example right 90% of the time, no matter what is the ground truth, A or B. This is presumably where we want to get to? Let's call this our 'ideal threshold', or 'ideal model G', or perhaps $G^*$ .

Now, let's say we have a loss function which is:

L = \frac{1}{N} \sum_{n = 1}^{N} I_{y_{i} \neq g (x_{i})}

$\mathcal{L} = \frac{1}{N}\sum_{n=1}^N I_{y_i \ne g(x_i)}$

where $I_c$ is an indicator variable that is $1$ when $c$ is true, else $0$ , $y_i$ is the true class for example $i$ , and $g(x_i)$ is the predicted class for example $i$ , by model G.

Imagine that we have a dataset that has 100 times as many training examples of class A than class B. And then we feed examples through. For every 99 examples of A, we expect to get $99*0.9 = 89.1$ examples correct, and $99*0.1=9.9$ examples incorrect. Similarly, for every 1 example of B, we expect to get $1 * 0.9=0.9$ examples correct, and $1 * 0.1=0.1$ examples incorrect. The expected loss will be:

$\mathcal{L} = (9.9 + 0.1)/100 = 0.1$

Now, lets look at a model $G$ where the threshold is set such that class A is systematically chosen. Now, for every 99 examples of A, all 99 will be correct. Zero loss. But each example of B will be systematically not chosen, giving a loss of $1/100$ , so the expected loss over the training set will be:

$\mathcal{L} = 0.01$

Ten times lower than the loss when setting the threshold such as to assign equal recall and precision to each class.

Therefore, the loss function will drive model $G$ to choose a threshold which chooses A with higher probability than class B, driving up the recall for class A, but lowering that for class B. The resulting model no longer matches what we might hope, no longer matches our ideal model $G^*$ .

To correct the model, we'd need to for example modify the loss function such that getting B wrong costs a lot more than getting A wrong. Then this will modify the loss function to have a minimum closer to the earlier ideal model $G^*$ , which assigned equal precision/recall to each class.

Alternatively, we can modify the dataset by cloning every B example 99 times, which will also cause the loss function to no longer have a minimum at a position different from our earlier ideal threshold.

— Hugh Perkins
소스

Can you please try to make your answer a bit more particular to the questions being asked? While clearly thoughtful it reads mostly as commentary rather than an answer. For example, just for commentary purposes one could argue that using an improper scoring rule like the loss function defined is fundamentally wrong and therefore the subsequent analysis is invalid.

— usεr11852 says Reinstate Monic

I dont think one can say that the loss function is 'right' or 'wrong' without knowing the actual purpose of the model. If the goal is for the machine learning model to 'look cool/useful', then the

G^{*}

$G^*$ model is better, but if it's to maximize eg scoring on some test/exam, where 99 of the questions have answer A, and one has answer B, and we only have a 90% chance of predicting the answer correctly, we're better off just choosing A for everything, and that's what the loss function above does.

— Hugh Perkins

I generally agree; I am not fully convinced about the proper scoring rule necessity but on the other hand the "actual purpose" of any classification model is the useful prediction of class membership, ie. you need an informed utility function. I would argue that generally for imbalanced problems assigning cost/gain to FP, TP, etc. is probably the best way to have a reasonable utility function; in the absence of relevant domain knowledge this can be hairy. I almost always use as my first choice Cohen's

k

$k$ , a somewhat conservative metric of "agreement", because of that reason.

— usεr11852 says Reinstate Monic

I googled for 'utility function', but nothing came up. Do you have a link/reference? I think from the context, what you are calling a 'utility function' is essentially the model

F

$F$ above? Model

F

$F$ is invariant across the various scenarios. One interesting question perhaps is, if one trains model

G

$G$ directly, using unbalanced data, will the underlying, possibly implicit, model

F

$F$ be similar/identical to a model

F

$F$ trained, via training model

G

$G$ , on balanced data?

— Hugh Perkins

This presumes implicitly (1) that the KPI we attempt to maximize is accuracy, and (2) that accuracy is an appropriate KPI for classification model evaluation. It isn't.

— S. Kolassa - Reinstate Monica

Note that one-class classifiers don't have an imbalance problem as they look at each class independently from all other classes and they can cope with "not-classes" by just not modeling them. (They may have a problem with too small sample size, of course).

Many problems that would be more appropriately modeled by one-class classifiers lead to ill-defined models when dicriminative approaches are used, of which "class imbalance problems" are one symptom.

As an example, consider some product that can be good to be sold or not. Such a situation is usually characterized by

class         | "good"                        | "not good"
--------------+-------------------------------+------------------------------------------
sample size   | large                         | small
              |                               |
feature space | single, well-delimited region | many possibilities of *something* wrong 
              |                               | (possibly well-defined sub-groups of
              |                               |    particular fault reasons/mechanisms) 
              |                               | => not a well defined region, 
              |                               | spread over large parts of feature space
              |                               |
future cases  | can be expected to end up     | may show up *anywhere* 
              | inside modeled region         | (except in good region)

Thus, class "good" is well-defined while class "not-good" is ill-defined. If such a situation is modeled by a discriminative classifier, we have a two-fold "imbalance problem": not only has the "not-good" class small sample size, it also has even lower sample density (fewer samples spread out over a larger part of the feature space).

This type of "class imbalance problem" will vanish when the task is modeled as one-class recognition of the well-defined "good" class.

— cbeleites supports Monica
소스