LDA 외에도 K-Means 와 함께 Latent Semantic Analysis 를 사용할 수 있습니다 . 신경망이 아니라 "클래식"클러스터링이지만 꽤 잘 작동합니다.
sklearn의 예 ( 여기 ) :
dataset = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
labels = dataset.target
true_k = np.unique(labels).shape[0]
vectorizer = TfidfTransformer()
X = vectorizer.fit_transform(dataset.data)
svd = TruncatedSVD(true_k)
lsa = make_pipeline(svd, Normalizer(copy=False))
X = lsa.fit_transform(X)
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100)
km.fit(X)
이제 클러스터 할당 레이블을 사용할 수 있습니다 km.labels_
예를 들어 다음은 LSA를 사용하여 20 개의 뉴스 그룹에서 추출한 주제입니다.
Cluster 0: space shuttle alaska edu nasa moon launch orbit henry sci
Cluster 1: edu game team games year ca university players hockey baseball
Cluster 2: sale 00 edu 10 offer new distribution subject lines shipping
Cluster 3: israel israeli jews arab jewish arabs edu jake peace israelis
Cluster 4: cmu andrew org com stratus edu mellon carnegie pittsburgh pa
Cluster 5: god jesus christian bible church christ christians people edu believe
Cluster 6: drive scsi card edu mac disk ide bus pc apple
Cluster 7: com ca hp subject edu lines organization writes article like
Cluster 8: car cars com edu engine ford new dealer just oil
Cluster 9: sun monitor com video edu vga east card monitors microsystems
Cluster 10: nasa gov jpl larc gsfc jsc center fnal article writes
Cluster 11: windows dos file edu ms files program os com use
Cluster 12: netcom com edu cramer fbi sandvik 408 writes article people
Cluster 13: armenian turkish armenians armenia serdar argic turks turkey genocide soviet
Cluster 14: uiuc cso edu illinois urbana uxa university writes news cobb
Cluster 15: edu cs university posting host nntp state subject organization lines
Cluster 16: uk ac window mit server lines subject university com edu
Cluster 17: caltech edu keith gatech technology institute prism morality sgi livesey
Cluster 18: key clipper chip encryption com keys escrow government algorithm des
Cluster 19: people edu gun com government don like think just access
신청할 수도 있습니다 음이 아닌 행렬 인수 분해를클러스터링으로 해석 될 수있는 수 . 변환 된 공간에서 각 문서의 가장 큰 구성 요소를 가져 와서 클러스터 할당으로 사용하기 만하면됩니다.
sklearn에서 :
nmf = NMF(n_components=k, random_state=1).fit_transform(X)
labels = nmf.argmax(axis=1)