여러 범주 형 열 변환

10

내 데이터 세트에는 두 개의 범주화 된 열이 있으며,이 열은 계산하고 싶습니다. 두 열 모두 국가를 포함하고 일부는 겹칩니다 (두 열 모두에 나타남). 같은 국가의 column1과 column2에 같은 번호를 지정하고 싶습니다.

내 데이터는 다소 비슷합니다.

import pandas as pd

d = {'col1': ['NL', 'BE', 'FR', 'BE'], 'col2': ['BE', 'NL', 'ES', 'ES']}
df = pd.DataFrame(data=d)
df

현재 데이터를 다음과 같이 변환하고 있습니다.

from sklearn.preprocessing import LabelEncoder
df.apply(LabelEncoder().fit_transform)

그러나 이것은 FR과 ES를 구별하지 않습니다. 다음과 같은 결과를 얻는 또 다른 간단한 방법이 있습니까?

o = {'col1': [2,0,1,0], 'col2': [0,2,4,4]}
output = pd.DataFrame(data=o)
output

— 독극물
소스

8

여기에 한 가지 방법이 있습니다

df.stack().astype('category').cat.codes.unstack()
Out[190]: 
   col1  col2
0     3     0
1     0     3
2     2     1
3     0     1

또는

s=df.stack()
s[:]=s.factorize()[0]
s.unstack()
Out[196]: 
   col1  col2
0     0     1
1     1     0
2     2     3
3     1     3

— YOBEN_S
소스

5

데이터 프레임에서 고유 한 값으로 LabelEncoder ()를 먼저 맞춘 다음 변환 할 수 있습니다.

le = LabelEncoder()
le.fit(pd.concat([df.col1, df.col2]).unique()) # or np.unique(df.values.reshape(-1,1))

df.apply(le.transform)
Out[28]: 
   col1  col2
0     3     0
1     0     3
2     2     1
3     0     1

— 마이클 가드너
소스

2

np.unique로 return_invesere. 그런 다음 DataFrame을 재구성해야합니다.

pd.DataFrame(np.unique(df, return_inverse=True)[1].reshape(df.shape),
             index=df.index,
             columns=df.columns)

   col1  col2
0     3     0
1     0     3
2     2     1
3     0     1

— 알 롤츠
소스