자주 구매하는 품목 시각화

10

CSV 파일에 다음 구조의 데이터 세트가 삽입되어 있습니다.

Banana  Water   Rice
Rice    Water
Bread   Banana  Juice

각 행은 함께 구매 한 품목의 모음을 나타냅니다. 예를 들어, 첫 번째 행은 항목 것을 나타내고 Banana, Water와 Rice함께 구입 하였다.

다음과 같은 시각화를 만들고 싶습니다.

이것은 기본적으로 그리드 차트이지만 입력 구조를 읽고 위와 같은 차트를 출력으로 생성 할 수있는 도구 (Python 또는 R)가 필요합니다.

— João_testeSW
소스

6

아마도 당신이 원하는 것은 불연속 버전의 히트 맵이라고 생각합니다. 예를 들어 아래를 참조하십시오. 빨간색은 가장 일반적으로 함께 구입 한 것을 나타내며 녹색 셀은 함께 구입하지 않습니다.

이것은 실제로 Pandas DataFrames 및 matplotlib과 함께 사용하기가 매우 쉽습니다.

import numpy as np
from pandas import DataFrame
import matplotlib
matplotlib.use('agg') # Write figure to disk instead of displaying (for Windows Subsystem for Linux)
import matplotlib.pyplot as plt

####
# Get data into a data frame
####
data = [
  ['Banana', 'Water', 'Rice'],
  ['Rice', 'Water'],
  ['Bread', 'Banana', 'Juice'],
]

# Convert the input into a 2D dictionary
freqMap = {}
for line in data:
  for item in line:
    if not item in freqMap:
      freqMap[item] = {}

    for other_item in line:
      if not other_item in freqMap:
        freqMap[other_item] = {}

      freqMap[item][other_item] = freqMap[item].get(other_item, 0) + 1
      freqMap[other_item][item] = freqMap[other_item].get(item, 0) + 1

df = DataFrame(freqMap).T.fillna(0)
print (df)

#####
# Create the plot
#####
plt.pcolormesh(df, edgecolors='black')
plt.yticks(np.arange(0.5, len(df.index), 1), df.index)
plt.xticks(np.arange(0.5, len(df.columns), 1), df.columns)
plt.savefig('plot.png')

— apnorton
소스

많은 감사 :) Spark Mllib를 사용하여 이것을 만들 수 있습니까?

— João_testeSW

@ João_testeSW 아마도 가능하지만 Spark에 익숙하지 않습니다.

— apnorton

이 코드를 실행하기위한 IDE를 추천 했습니까?

— João_testeSW

@ João_testeSW 파일에 "somescript.py"로 저장하면 터미널에서 "python3 somescript.py"를 사용하여 실행할 수 있습니다. IDE가 필요하지는 않지만 일부 Python 지원 IDE에로드하면 실행됩니다.

— apnorton

고맙습니다;) 내가 그것을 Pyspark에서 사용할 수 있는지 볼 것입니다, 그렇다면 그렇다면 솔루션으로 게시물을 편집 할 수 있습니다;)

— João_testeSW

3

를 들어 R, 당신은 라이브러리를 사용할 수 있습니다 ArulesViz. 훌륭한 문서가 있으며 12 페이지에는 이러한 종류의 시각화를 만드는 방법에 대한 예가 있습니다.

이를위한 코드는 다음과 같이 간단합니다.

plot(rules, method="grouped")

— 혼자 브
소스

OP가 찾고있는 것은 아니지만이 라이브러리를 사용하는 훌륭한 시각화 사례는 다음과 같습니다. algobeans.com/2016/04/01/…

— user35581

0

와 볼프람 언어 에 티카 .

data = {{"Banana", "Water", "Rice"},
        {"Rice", "Water"},
        {"Bread", "Banana", "Juice"}};

짝수를 구하십시오.

counts = Sort /@ Flatten[Subsets[#, {2}] & /@ data, 1] // Tally

{{{"Banana", "Water"}, 1}, {{"Banana", "Rice"}, 1}, 
 {{"Rice", "Water"}, 2}, {{"Banana", "Bread"}, 1}, 
 {{"Bread", "Juice"}, 1}, {{"Banana", "Juice"}, 1}}

명명 된 진드기에 대한 인덱스를 가져옵니다.

indices = Thread[# -> Range[Length@#]] &@Sort@DeleteDuplicates@Flatten[data]

{"Banana" -> 1, "Bread" -> 2, "Juice" -> 3, "Rice" -> 4, "Water" -> 5}

을 MatrixPlot사용하여 플로팅합니다 SparseArray. 또한 사용할 수 있습니다 ArrayPlot.

MatrixPlot[
 SparseArray[Rule @@@ counts /. indices, ConstantArray[Length@indices, 2]],
 FrameTicks -> With[{t = {#2, #1} & @@@ indices}, {{t, None}, {t, None}}],
 PlotLegends -> Automatic
 ]

상단 삼각형입니다.

도움이 되었기를 바랍니다.

— 에드먼드
소스

0

파이썬에서 seaborn 시각화 라이브러리 (matplotlib 위에 빌드)를 사용 하여이 작업을 수행 할 수 있습니다.

data = [
  ['Banana', 'Water', 'Rice'],
  ['Rice', 'Water'],
  ['Bread', 'Banana', 'Juice'],
]

# Pull out combinations
from itertools import combinations
data_pairs = []
for d in data:
    data_pairs += [list(sorted(x)) + [1] for x in combinations(d, 2)]
    # Add reverse as well (this will mirror the heatmap)
    data_pairs += [list(sorted(x))[::-1] + [1] for x in combinations(d, 2)]

# Shape into dataframe
import pandas as pd
df = pd.DataFrame(data_pairs)
df_zeros = pd.DataFrame([list(x) + [0] for x in combinations(df[[0, 1]].values.flatten(), 2)])
df = pd.concat((df, df_zeros))
df = df.groupby([0, 1])[2].sum().reset_index().pivot(0, 1, 2).fillna(0)

import seaborn as sns
from matplotlib.pyplot import plt
sns.heatmap(df, cmap='YlGnBu')
plt.show()

최종 데이터 프레임 df은 다음과 같습니다.

결과 시각화는 다음과 같습니다.

— AlexG
소스