그래프 구조화 된 데이터에 Scikit-Learn 레이블 전파를 사용하는 방법은 무엇입니까?

연구의 일환으로 그래프에서 레이블 전파를 수행하는 데 관심이 있습니다. 특히이 두 가지 방법에 관심이 있습니다.

Xiaojin Zhu와 Zoubin Ghahramani. 레이블 전파를 통해 레이블이있는 데이터와 레이블이없는 데이터로부터 학습 기술 보고서 CMU-CALD-02-107, 2002 년 카네기 멜론 대학교 http://pages.cs.wisc.edu/~jerryzhu/pub/CMU-CALD-02-107.pdf
Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, Bernhard Schoelkopf. 지역 및 글로벌 일관성을 갖춘 학습 (2004) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.3219

나는 scikit-learn이 그것을 할 모델 을 제공하는 것을 보았습니다 . 그러나이 모델은 벡터 구조화 된 데이터 ( 즉, 데이터 포인트) 에 적용되어야합니다 .

이 모델은 커널을 사용하여 데이터 포인트에서 선호도 매트릭스를 작성한 다음 구성된 매트릭스에서 알고리즘을 실행합니다. 유사성 행렬 대신 그래프의 인접 행렬을 직접 입력하고 싶습니다.

그것을 달성하는 방법에 대한 아이디어가 있습니까? 또는 앞서 언급 한 두 가지 방법에 대해 그래프 구조화 된 데이터에서 레이블 전파를 직접 실행할 수있는 Python 라이브러리를 알고 있습니까?

도움을 주셔서 감사합니다.

scikit-learn graphs

— 티바 우드 마르티네즈
소스

선호도 행렬을 계산 한 후 Scikit-learn 소스 코드를 검사하여 소스 코드의 기능을 확인 했습니까? 해당 부분 뒤에있는 코드를 "복사"하여 인접 매트릭스에 직접 적용 할 수 있습니다.

— Tasos

귀하의 의견에 감사드립니다! 실제로 이것은 현재 내가하고있는 일이지만 내 요구에 맞게 수정 해야하는 코드의 일부 부분은 다소 암호입니다. 해당 부분을 다시 쓰면 오류가 발생하기를 두려워합니다. 더 간단한 방법이 있기를 바랐습니다.

— Thibaud Martinez

github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/… 의 소스 코드는 구현이 _build_graph 메소드를 재정의해야한다고 말합니다. 따라서 사전 계산 된 행렬을 허용하는 파생 클래스를 만들어야합니다.

— mikalai

일부 독자에게 도움이되기를 바랍니다.

Scikit-learn은 주로 벡터 구조화 된 데이터를 처리하도록 설계되었습니다. 따라서 그래프 구조의 데이터에 대해 레이블 전파 / 라벨 확산을 수행하려면 Scikit 인터페이스를 사용하는 대신 방법을 직접 구현하는 것이 좋습니다.

다음은 PyTorch에서 레이블 전파 및 레이블 확산을 구현 한 것입니다.

두 가지 방법은 전체적으로 동일한 알고리즘 단계를 따르며 인접 행렬이 정규화되는 방식과 각 단계에서 레이블이 전파되는 방식이 다양합니다. 따라서 두 모델의 기본 클래스를 만들어 봅시다.

from abc import abstractmethod
import torch

class BaseLabelPropagation:
    """Base class for label propagation models.

    Parameters
    ----------
    adj_matrix: torch.FloatTensor
        Adjacency matrix of the graph.
    """
    def __init__(self, adj_matrix):
        self.norm_adj_matrix = self._normalize(adj_matrix)
        self.n_nodes = adj_matrix.size(0)
        self.one_hot_labels = None 
        self.n_classes = None
        self.labeled_mask = None
        self.predictions = None

    @staticmethod
    @abstractmethod
    def _normalize(adj_matrix):
        raise NotImplementedError("_normalize must be implemented")

    @abstractmethod
    def _propagate(self):
        raise NotImplementedError("_propagate must be implemented")

    def _one_hot_encode(self, labels):
        # Get the number of classes
        classes = torch.unique(labels)
        classes = classes[classes != -1]
        self.n_classes = classes.size(0)

        # One-hot encode labeled data instances and zero rows corresponding to unlabeled instances
        unlabeled_mask = (labels == -1)
        labels = labels.clone()  # defensive copying
        labels[unlabeled_mask] = 0
        self.one_hot_labels = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)
        self.one_hot_labels = self.one_hot_labels.scatter(1, labels.unsqueeze(1), 1)
        self.one_hot_labels[unlabeled_mask, 0] = 0

        self.labeled_mask = ~unlabeled_mask

    def fit(self, labels, max_iter, tol):
        """Fits a semi-supervised learning label propagation model.

        labels: torch.LongTensor
            Tensor of size n_nodes indicating the class number of each node.
            Unlabeled nodes are denoted with -1.
        max_iter: int
            Maximum number of iterations allowed.
        tol: float
            Convergence tolerance: threshold to consider the system at steady state.
        """
        self._one_hot_encode(labels)

        self.predictions = self.one_hot_labels.clone()
        prev_predictions = torch.zeros((self.n_nodes, self.n_classes), dtype=torch.float)

        for i in range(max_iter):
            # Stop iterations if the system is considered at a steady state
            variation = torch.abs(self.predictions - prev_predictions).sum().item()

            if variation < tol:
                print(f"The method stopped after {i} iterations, variation={variation:.4f}.")
                break

            prev_predictions = self.predictions
            self._propagate()

    def predict(self):
        return self.predictions

    def predict_classes(self):
        return self.predictions.max(dim=1).indices

모델은 노드의 레이블뿐만 아니라 그래프의 인접 행렬을 입력으로 사용합니다. 레이블은 레이블이없는 노드의 위치에 -1이있는 각 노드의 클래스 번호를 나타내는 정수 벡터 형식입니다.

라벨 전파 알고리즘은 다음과 같습니다.

\begin{array}{l} 여 : 그래프의 인접 행렬 대각도 행렬 계산 디 으로 디_{나는 나는} \leftarrow \sum_{제이} 여_{나는 제이} \\ 초기화 {\hat{와이}}^{(0)} \leftarrow ({와이}_{1}, \dots, {와이}_{엘}, 0, 0, \dots, 0) \\ 반복 \\ 1. {\hat{와이}}^{(티 + 1)} \leftarrow 디^{- 1} 여 {\hat{와이}}^{(티)} \\ 2. {\hat{와이}}_{엘}^{(티 + 1)} \leftarrow {와이}_{엘} \\ 수렴 할 때까지 {\hat{와이}}^{(\infty)} \\ 라벨 포인트 {엑스}_{나는} 의 표시로 {\hat{와이}}_{나는}^{(\infty)} \end{array}

$\begin{array}{l}{ \mathbf{W} \text {: adjacency matrix of the graph} \\ \text { Compute the diagonal degree matrix } \mathbf{D} \text { by } \mathbf{D}_{i i} \leftarrow \sum_{j} W_{i j}} \\ {\text { Initialize } \hat{Y}^{(0)} \leftarrow\left(y_{1}, \ldots, y_{l}, 0,0, \ldots, 0\right)} \\ {\text { Iterate }} \\ {\text { 1. } \hat{Y}^{(t+1)} \leftarrow \mathbf{D}^{-1} \mathbf{W} \hat{Y}^{(t)}} \\ {\text { 2. } \hat{Y}_{l}^{(t+1)} \leftarrow Y_{l}} \\ {\text { until convergence to } \hat{Y}^{(\infty)}} \\ {\text { Label point } x_{i} \text { by the sign of } \hat{y}_{i}^{(\infty)}}\end{array}$

에서 Xiaojin Zhu의와 주빈 가아라 마니. 레이블 전파를 통해 레이블이있는 데이터와 레이블이없는 데이터로부터 학습 기술 보고서 CMU-CALD-02-107, Carnegie Mellon University, 2002

우리는 다음과 같은 구현을 얻습니다.

class LabelPropagation(BaseLabelPropagation):
    def __init__(self, adj_matrix):
        super().__init__(adj_matrix)

    @staticmethod
    def _normalize(adj_matrix):
        """Computes D^-1 * W"""
        degs = adj_matrix.sum(dim=1)
        degs[degs == 0] = 1  # avoid division by 0 error
        return adj_matrix / degs[:, None]

    def _propagate(self):
        self.predictions = torch.matmul(self.norm_adj_matrix, self.predictions)

        # Put back already known labels
        self.predictions[self.labeled_mask] = self.one_hot_labels[self.labeled_mask]

    def fit(self, labels, max_iter=1000, tol=1e-3):
        super().fit(labels, max_iter, tol)

라벨 확산 알고리즘은 다음과 같습니다.

\begin{array}{l} 여 : 그래프의 인접 행렬 대각도 행렬 계산 디 으로 디_{나는 나는} \leftarrow \sum_{제이} 여_{나는 제이} \\ 정규화 된 그래프 라플라시안 계산 엘 \leftarrow 디^{- 1 / 2} 여 디^{- 1 / 2} \\ 초기화 {\hat{와이}}^{(0)} \leftarrow ({와이}_{1}, \dots, {와이}_{엘}, 0, 0, \dots, 0) \\ 매개 변수를 선택하십시오 α \in [0, 1) \\ 반복 \hat{와이} (티 + 1) \leftarrow α 엘 {\hat{와이}}^{(티)} + (1 - α) {\hat{와이}}^{(0)} 수렴 할 때까지 {\hat{와이}}^{(\infty)} \\ 라벨 포인트 {엑스}_{나는} 의 표시로 {\hat{와이}}_{나는}^{(\infty)} \end{array}

$\begin{array}{l}{ \mathbf{W} \text {: adjacency matrix of the graph} \\ \text { Compute the diagonal degree matrix } \mathbf{D} \text { by } \mathbf{D}_{i i} \leftarrow \sum_{j} W_{i j}} \\ {\text { Compute the normalized graph Laplacian } \\ \mathcal{L} \leftarrow \mathbf{D}^{-1 / 2} \mathbf{W} \mathbf{D}^{-1 / 2}} \\ {\text { Initialize } \hat{Y}^{(0)} \leftarrow\left(y_{1}, \ldots, y_{l}, 0,0, \ldots, 0\right)} \\ {\text { Choose a parameter } \alpha \in[0,1)} \\ {\text { Iterate } \hat{Y}(t+1) \leftarrow \alpha \mathcal{L} \hat{Y}^{(t)}+(1-\alpha) \hat{Y}^{(0)} \text { until convergence to } \hat{Y}^{(\infty)}} \\ {\text { Label point } x_{i} \text { by the sign of } \hat{y}_{i}^{(\infty)}} \end{array}$

에서 Dengyong 저우, 올리비에 보스 켓, 토마스 나빈 랄, 제이슨 웨스턴, 베른하르트 Schoelkopf. 지역적 및 세계적 일관성을 갖춘 학습 (2004)

따라서 구현은 다음과 같습니다.

class LabelSpreading(BaseLabelPropagation):
    def __init__(self, adj_matrix):
        super().__init__(adj_matrix)
        self.alpha = None

    @staticmethod
    def _normalize(adj_matrix):
        """Computes D^-1/2 * W * D^-1/2"""
        degs = adj_matrix.sum(dim=1)
        norm = torch.pow(degs, -0.5)
        norm[torch.isinf(norm)] = 1
        return adj_matrix * norm[:, None] * norm[None, :]

    def _propagate(self):
        self.predictions = (
            self.alpha * torch.matmul(self.norm_adj_matrix, self.predictions)
            + (1 - self.alpha) * self.one_hot_labels
        )

    def fit(self, labels, max_iter=1000, tol=1e-3, alpha=0.5):
        """
        Parameters
        ----------
        alpha: float
            Clamping factor.
        """
        self.alpha = alpha
        super().fit(labels, max_iter, tol)

합성 데이터에 대한 전파 모델을 테스트 해 봅시다. 이를 위해 원시인 그래프 를 사용하기로 결정했습니다 .

import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

# Create caveman graph
n_cliques = 4
size_cliques = 10
caveman_graph = nx.connected_caveman_graph(n_cliques, size_cliques)
adj_matrix = nx.adjacency_matrix(caveman_graph).toarray()

# Create labels
labels = np.full(n_cliques * size_cliques, -1.)

# Only one node per clique is labeled. Each clique belongs to a different class.
labels[0] = 0
labels[size_cliques] = 1
labels[size_cliques * 2] = 2
labels[size_cliques * 3] = 3

# Create input tensors
adj_matrix_t = torch.FloatTensor(adj_matrix)
labels_t = torch.LongTensor(labels)

# Learn with Label Propagation
label_propagation = LabelPropagation(adj_matrix_t)
label_propagation.fit(labels_t)
label_propagation_output_labels = label_propagation.predict_classes()

# Learn with Label Spreading
label_spreading = LabelSpreading(adj_matrix_t)
label_spreading.fit(labels_t, alpha=0.8)
label_spreading_output_labels = label_spreading.predict_classes()

# Plot graphs
color_map = {-1: "orange", 0: "blue", 1: "green", 2: "red", 3: "cyan"}
input_labels_colors = [color_map[l] for l in labels]
lprop_labels_colors = [color_map[l] for l in label_propagation_output_labels.numpy()]
lspread_labels_colors = [color_map[l] for l in label_spreading_output_labels.numpy()]

plt.figure(figsize=(14, 6))
ax1 = plt.subplot(1, 4, 1)
ax2 = plt.subplot(1, 4, 2)
ax3 = plt.subplot(1, 4, 3)

ax1.title.set_text("Raw data (4 classes)")
ax2.title.set_text("Label Propagation")
ax3.title.set_text("Label Spreading")

pos = nx.spring_layout(caveman_graph)
nx.draw(caveman_graph, ax=ax1, pos=pos, node_color=input_labels_colors, node_size=50)
nx.draw(caveman_graph, ax=ax2, pos=pos, node_color=lprop_labels_colors, node_size=50)
nx.draw(caveman_graph, ax=ax3, pos=pos, node_color=lspread_labels_colors, node_size=50)

# Legend
ax4 = plt.subplot(1, 4, 4)
ax4.axis("off")
legend_colors = ["orange", "blue", "green", "red", "cyan"]
legend_labels = ["unlabeled", "class 0", "class 1", "class 2", "class 3"]
dummy_legend = [ax4.plot([], [], ls='-', c=c)[0] for c in legend_colors]
plt.legend(dummy_legend, legend_labels)

plt.show()

구현 된 모델은 올바르게 작동하며 그래프에서 커뮤니티를 감지 할 수 있습니다.

참고 : 제시된 전파 방법은 무 방향 그래프에 사용됩니다.

코드는 대화식 Jupyter 노트북으로 제공 됩니다 .

— 티바 우드 M
소스