K는 팔꿈치 방법, BIC, 분산 설명 및 실루엣으로 K를 선택하는 일관성없는 동작을 의미합니다.

K- 평균으로 90 개의 특징을 가진 일부 벡터를 클러스터하려고합니다. 이 알고리즘은 클러스터의 수를 묻기 때문에 좋은 수학으로 내 선택을 확인하고 싶습니다. 8 개에서 10 개의 클러스터가있을 것으로 예상합니다. 기능은 Z- 점수 스케일입니다.

팔꿈치 방법 및 분산 설명

from scipy.spatial.distance import cdist, pdist
from sklearn.cluster import KMeans

K = range(1,50)
KM = [KMeans(n_clusters=k).fit(dt_trans) for k in K]
centroids = [k.cluster_centers_ for k in KM]

D_k = [cdist(dt_trans, cent, 'euclidean') for cent in centroids]
cIdx = [np.argmin(D,axis=1) for D in D_k]
dist = [np.min(D,axis=1) for D in D_k]
avgWithinSS = [sum(d)/dt_trans.shape[0] for d in dist]

# Total with-in sum of square
wcss = [sum(d**2) for d in dist]
tss = sum(pdist(dt_trans)**2)/dt_trans.shape[0]
bss = tss-wcss

kIdx = 10-1

# elbow curve
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, avgWithinSS, 'b*-')
ax.plot(K[kIdx], avgWithinSS[kIdx], marker='o', markersize=12, 
markeredgewidth=2, markeredgecolor='r', markerfacecolor='None')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Average within-cluster sum of squares')
plt.title('Elbow for KMeans clustering')

fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(K, bss/tss*100, 'b*-')
plt.grid(True)
plt.xlabel('Number of clusters')
plt.ylabel('Percentage of variance explained')
plt.title('Elbow for KMeans clustering')

팔꿈치 방법

이 두 그림에서 클러스터의 수는 절대 멈추지 않는 것 같습니다 : D. 이상한! 팔꿈치는 어디에 있습니까? K를 어떻게 선택할 수 있습니까?

베이지안 정보 기준

이 방법은 X- 평균에서 직접 제공되며 BIC 를 사용하여 군집 수를 선택합니다. 다른 심판

    from sklearn.metrics import euclidean_distances
from sklearn.cluster import KMeans

def bic(clusters, centroids):
    num_points = sum(len(cluster) for cluster in clusters)
    num_dims = clusters[0][0].shape[0]
    log_likelihood = _loglikelihood(num_points, num_dims, clusters, centroids)
    num_params = _free_params(len(clusters), num_dims)
    return log_likelihood - num_params / 2.0 * np.log(num_points)


def _free_params(num_clusters, num_dims):
    return num_clusters * (num_dims + 1)


def _loglikelihood(num_points, num_dims, clusters, centroids):
    ll = 0
    for cluster in clusters:
        fRn = len(cluster)
        t1 = fRn * np.log(fRn)
        t2 = fRn * np.log(num_points)
        variance = _cluster_variance(num_points, clusters, centroids) or np.nextafter(0, 1)
        t3 = ((fRn * num_dims) / 2.0) * np.log((2.0 * np.pi) * variance)
        t4 = (fRn - 1.0) / 2.0
        ll += t1 - t2 - t3 - t4
    return ll

def _cluster_variance(num_points, clusters, centroids):
    s = 0
    denom = float(num_points - len(centroids))
    for cluster, centroid in zip(clusters, centroids):
        distances = euclidean_distances(cluster, centroid)
        s += (distances*distances).sum()
    return s / denom

from scipy.spatial import distance
def compute_bic(kmeans,X):
    """
    Computes the BIC metric for a given clusters

    Parameters:
    -----------------------------------------
    kmeans:  List of clustering object from scikit learn

    X     :  multidimension np array of data points

    Returns:
    -----------------------------------------
    BIC value
    """
    # assign centers and labels
    centers = [kmeans.cluster_centers_]
    labels  = kmeans.labels_
    #number of clusters
    m = kmeans.n_clusters
    # size of the clusters
    n = np.bincount(labels)
    #size of data set
    N, d = X.shape

    #compute variance for all clusters beforehand
    cl_var = (1.0 / (N - m) / d) * sum([sum(distance.cdist(X[np.where(labels == i)], [centers[0][i]], 'euclidean')**2) for i in range(m)])

    const_term = 0.5 * m * np.log(N) * (d+1)

    BIC = np.sum([n[i] * np.log(n[i]) -
               n[i] * np.log(N) -
             ((n[i] * d) / 2) * np.log(2*np.pi*cl_var) -
             ((n[i] - 1) * d/ 2) for i in range(m)]) - const_term

    return(BIC)



sns.set_style("ticks")
sns.set_palette(sns.color_palette("Blues_r"))
bics = []
for n_clusters in range(2,50):
    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(dt_trans)

    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_

    clusters = {}
    for i,d in enumerate(kmeans.labels_):
        if d not in clusters:
            clusters[d] = []
        clusters[d].append(dt_trans[i])

    bics.append(compute_bic(kmeans,dt_trans))#-bic(clusters.values(), centroids))

plt.plot(bics)
plt.ylabel("BIC score")
plt.xlabel("k")
plt.title("BIC scoring for K-means cell's behaviour")
sns.despine()
#plt.savefig('figures/K-means-BIC.pdf', format='pdf', dpi=330,bbox_inches='tight')

여기에 이미지 설명을 입력하십시오

여기서도 같은 문제입니다 ... K는 무엇입니까?

실루엣

    from sklearn.metrics import silhouette_score

s = []
for n_clusters in range(2,30):
    kmeans = KMeans(n_clusters=n_clusters)
    kmeans.fit(dt_trans)

    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_

    s.append(silhouette_score(dt_trans, labels, metric='euclidean'))

plt.plot(s)
plt.ylabel("Silouette")
plt.xlabel("k")
plt.title("Silouette for K-means cell's behaviour")
sns.despine()

여기에 이미지 설명을 입력하십시오

알렐루야! 여기에 의미가있는 것 같고 이것이 내가 기대하는 것입니다. 그러나 왜 이것이 다른 것들과 다른가?

clustering k-means

— 마르코 데나
소스

분산 사례에서 무릎에 대한 질문에 대답하기 위해 약 6 또는 7 인 것처럼 보입니다. 곡선에 대한 두 개의 선형 근사 세그먼트 사이의 중단 점으로 상상할 수 있습니다. 그래프의 모양은 드문 일이 아닙니다. % 분산은 종종 무증상 100 %에 접근합니다. 나는 당신의 BIC 그래프에 k를 조금 더 낮게, 약 5로

— 넣었습니다

그러나 모든 방법에서 동일한 결과를 가져야합니다.

— marcodena

할 말이 충분하지 않다고 생각합니다. 세 가지 방법이 수학적으로 모든 데이터와 동등하다는 것을 의심합니다. 그렇지 않으면 별도의 기술로 존재하지 않으므로 비교 결과는 데이터에 따라 다릅니다. 두 가지 방법은 서로 가까운 수의 군집을 제공하며, 세 번째는 더 높지만 막대한 수는 없습니다. 실제 클러스터 수에 대한 사전 정보가 있습니까?

— image_doctor

100 % 확신 할 수는 없지만 8 개에서 10 개의 클러스터를 가질 것으로 예상됩니다

— marcodena

당신은 이미 "차원의 저주"의 블랙홀에 있습니다. 차원 축소 전에는 아무것도 작동하지 않습니다.

— Kasra Manshaei

답변:

위의 의견과 몇 가지 생각에 대한 요약을 게시하면이 질문이 "답변되지 않은 질문"에서 제거됩니다.

Image_doctor의 의견에 따르면 이러한 그래프는 k- 평균에 일반적입니다. (그러나 "Silhouette"측정 값에는 익숙하지 않습니다.) 클러스터 내 분산은 k가 증가함에 따라 계속 감소 할 것으로 예상됩니다. 팔꿈치는 커브가 가장 많이 구부러지는 곳입니다. (수학적인 것을 원한다면 "2 차 미분"이라고 생각할 수도 있습니다.)

일반적으로 최종 작업을 사용하여 k를 선택하는 것이 가장 좋습니다. 결정을 내리기 위해 클러스터의 통계적 측정 값을 사용하지 말고 시스템의 엔드 투 엔드 성능을 사용하여 선택을 안내하십시오. 통계는 시작점으로 만 사용하십시오.

— 요아킴 바그너
소스

연속 세그먼트 사이의 각도를 계산하여 팔꿈치를 찾는 것이 더 쉬워 질 수 있습니다.

교체하십시오 :

kIdx = 10-1

와:

seg_threshold = 0.95 #Set this to your desired target

#The angle between three points
def segments_gain(p1, v, p2):
    vp1 = np.linalg.norm(p1 - v)
    vp2 = np.linalg.norm(p2 - v)
    p1p2 = np.linalg.norm(p1 - p2)
    return np.arccos((vp1**2 + vp2**2 - p1p2**2) / (2 * vp1 * vp2)) / np.pi

#Normalize the data
criterion = np.array(avgWithinSS)
criterion = (criterion - criterion.min()) / (criterion.max() - criterion.min())

#Compute the angles
seg_gains = np.array([0, ] + [segments_gain(*
        [np.array([K[j], criterion[j]]) for j in range(i-1, i+2)]
    ) for i in range(len(K) - 2)] + [np.nan, ])

#Get the first index satisfying the threshold
kIdx = np.argmax(seg_gains > seg_threshold)

그리고 당신은 다음과 같은 것을 보게 될 것입니다 :

seg_gains를 시각화하면 다음과 같은 내용이 표시됩니다.

까다로운 팔꿈치를 찾을 수 있기를 바랍니다. :)

— 살루 울
소스

나는 Kneedle algorithim 을 구현하여 이와 같은 함수에서 최대 곡률 점을 감지 하는 Python 라이브러리 를 만들었습니다 . 로 설치할 수 있습니다 .pip install kneed

네 가지 형태의 함수를위한 코드 및 출력 :

from kneed.data_generator import DataGenerator
from kneed.knee_locator import KneeLocator

import numpy as np

import matplotlib.pyplot as plt

# sample x and y
x = np.arange(0,10)
y_convex_inc = np.array([1,2,3,4,5,10,15,20,40,100])
y_convex_dec = y_convex_inc[::-1]
y_concave_dec = 100 - y_convex_inc
y_concave_inc = 100 - y_convex_dec

# find the knee points
kn = KneeLocator(x, y_convex_inc, curve='convex', direction='increasing')
knee_yconvinc = kn.knee

kn = KneeLocator(x, y_convex_dec, curve='convex', direction='decreasing')
knee_yconvdec = kn.knee

kn = KneeLocator(x, y_concave_inc, curve='concave', direction='increasing')
knee_yconcinc = kn.knee

kn = KneeLocator(x, y_concave_dec, curve='concave', direction='decreasing')
knee_yconcdec = kn.knee

# plot
f, axes = plt.subplots(2, 2, figsize=(10,10));
yconvinc = axes[0][0]
yconvdec = axes[0][1]
yconcinc = axes[1][0]
yconcdec = axes[1][1]

yconvinc.plot(x, y_convex_inc)
yconvinc.vlines(x=knee_yconvinc, ymin=0, ymax=100, linestyle='--')
yconvinc.set_title("curve='convex', direction='increasing'")

yconvdec.plot(x, y_convex_dec)
yconvdec.vlines(x=knee_yconvdec, ymin=0, ymax=100, linestyle='--')
yconvdec.set_title("curve='convex', direction='decreasing'")

yconcinc.plot(x, y_concave_inc)
yconcinc.vlines(x=knee_yconcinc, ymin=0, ymax=100, linestyle='--')
yconcinc.set_title("curve='concave', direction='increasing'")

yconcdec.plot(x, y_concave_dec)
yconcdec.vlines(x=knee_yconcdec, ymin=0, ymax=100, linestyle='--')
yconcdec.set_title("curve='concave', direction='decreasing'");

— 케빈
소스