SQL에서와 같이 'in'과 'not in'을 사용하여 Pandas 데이터 프레임을 필터링하는 방법

432

SQL IN과 동등한 것을 어떻게 달성 할 수 NOT IN있습니까?

필요한 값이있는 목록이 있습니다. 시나리오는 다음과 같습니다.

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

# pseudo-code:
df[df['countries'] not in countries]

이 작업을 수행하는 현재 방법은 다음과 같습니다.

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = pd.DataFrame({'countries':['UK','China'], 'matched':True})

# IN
df.merge(countries,how='inner',on='countries')

# NOT IN
not_in = df.merge(countries,how='left',on='countries')
not_in = not_in[pd.isnull(not_in['matched'])]

그러나 이것은 끔찍한 kludge처럼 보인다. 누구든지 그것을 향상시킬 수 있습니까?

— 런던 롭
소스

1

귀하의 솔루션이 최고의 솔루션이라고 생각합니다. 여러 열 중 IN, NOT_IN을 커버 할 수 있습니다.

— Bruce Jung

단일 컬럼 또는 다중 컬럼에서 테스트 하시겠습니까?

— smci

1

관련 (성능 / 팬더 내부) : 팬더 pd.Series.is 대 성능 대 세트

— jpp

820

사용할 수 있습니다 pd.Series.isin.

"IN"사용 : something.isin(somewhere)

또는 "NOT IN"의 경우 : ~something.isin(somewhere)

실례로 :

>>> df
  countries
0        US
1        UK
2   Germany
3     China
>>> countries
['UK', 'China']
>>> df.countries.isin(countries)
0    False
1     True
2    False
3     True
Name: countries, dtype: bool
>>> df[df.countries.isin(countries)]
  countries
1        UK
3     China
>>> df[~df.countries.isin(countries)]
  countries
0        US
2   Germany

— DSM
소스

1

참고로 @LondonRob은 DataFrame으로 자신을 가지고 있었고 당신은 Series입니다. DataFrame isin은 .13에 추가되었습니다.

— TomAugspurger

팬더 0.12.0 으로이 작업을 수행하는 방법에 대한 제안 사항이 있습니까? 현재 출시 된 버전입니다. (아마도 0.13을 기다려야

— 할까요

실제로 1 차원 배열을 다루는 경우 (예와 같이) @DSM과 같은 DataFrame 대신 Series를 사용하십시오.df = pd.Series({'countries':['US','UK','Germany','China']})

— TomAugspurger

2

@TomAugspurger : 평소와 같이, 아마도 뭔가 빠졌을 것입니다. df저와 그의 둘 다입니다 DataFrame. countries목록입니다. df[~df.countries.isin(countries)]는 DataFrame아닌을 생성하고 Series0.11.0.dev-14a04dd에서도 작동하는 것으로 보입니다.

— DSM

7

countries변수 를 계속 재사용하기 때문에이 대답은 혼란 스럽습니다 . 글쎄, OP는 그것을 수행하고 상속되었지만, 이전에 뭔가 잘못되었다는 것이 지금 그것을 잘못하는 것을 정당화하지는 않습니다.

— ifly6

63

.query () 메소드 를 사용하는 대체 솔루션 :

In [5]: df.query("countries in @countries")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("countries not in @countries")
Out[6]:
  countries
0        US
2   Germany

— MaxU
소스

10

@LondonRob query는 더 이상 실험적이지 않습니다.

— Paul Rougieux 2014 년

38

팬더 DataFrame에 대해 'in'과 'not in'을 구현하는 방법은 무엇입니까?

팬더 이벤트 두 가지 방법 : Series.isin및 DataFrame.isin각각 시리즈 및 DataFrames에 대해.

하나의 열을 기준으로 데이터 프레임 필터링 (시리즈에도 적용)

가장 일반적인 시나리오는 isin특정 열에 조건을 적용하여 DataFrame에서 행을 필터링하는 것입니다.

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Series.isin다양한 유형을 입력으로 받아들입니다. 다음은 원하는 것을 얻는 유효한 방법입니다.

df['countries'].isin(c1)

0    False
1     True
2    False
3    False
4     True
Name: countries, dtype: bool

# `in` operation
df[df['countries'].isin(c1)]

  countries
1        UK
4     China

# `not in` operation
df[~df['countries'].isin(c1)]

  countries
0        US
2   Germany
3       NaN

# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]

  countries
2   Germany

# Filter with another Series
df[df['countries'].isin(c3)]

  countries
0        US
4     China

# Filter with array
df[df['countries'].isin(c4)]

  countries
0        US
1        UK

많은 열에서 필터링

경우에 따라 여러 열에 대해 일부 검색어와 함께 'in'멤버십 확인을 적용하려고 할 수 있습니다.

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2

   A    B  C
0  x    w  0
1  y    a  1
2  z  NaN  2
3  q    x  3

c1 = ['x', 'w', 'p']

isin"A"및 "B"열 모두에 조건 을 적용하려면 DataFrame.isin다음을 사용하십시오 .

df2[['A', 'B']].isin(c1)

      A      B
0   True   True
1  False  False
2  False  False
3  False   True

이것으로부터 적어도 하나의 열이있는 행을 유지하기 위해Trueany 첫 번째 축을 따라 사용할 수 있습니다 .

df2[['A', 'B']].isin(c1).any(axis=1)

0     True
1    False
2    False
3     True
dtype: bool

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

   A  B  C
0  x  w  0
3  q  x  3

모든 열을 검색하려면 열 선택 단계를 생략하고 수행하십시오.

df2.isin(c1).any(axis=1)

마찬가지로 모든 열이있는 행을 유지True 하려면 all이전과 동일한 방식으로 사용 하십시오.

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

   A  B  C
0  x  w  0

주목할만한 언급 : `numpy.isin`,, `query`목록 이해 (문자열 데이터)

위에서 설명한 방법 외에도 numpy equivalent :을 사용할 수도 있습니다 numpy.isin.

# `in` operation
df[np.isin(df['countries'], c1)]

  countries
1        UK
4     China

# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]

  countries
0        US
2   Germany
3       NaN

고려할 가치가있는 이유는 무엇입니까? NumPy 함수는 일반적으로 오버 헤드가 낮아 팬더에 비해 약간 빠릅니다. 이는 인덱스 정렬에 의존하지 않는 요소 별 연산이므로,이 방법이 pandas를 대체 할 수없는 상황은 거의 없습니다 isin.

문자열 작업은 벡터화하기 어렵 기 때문에 문자열 작업시 팬더 루틴은 일반적으로 반복됩니다. 여기서 목록 이해력이 더 빠를 것이라는 많은 증거가 있습니다. . 우리는 in지금 수표에 의지한다 .

c1_set = set(c1) # Using `in` with `sets` is a constant time operation... 
                 # This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]

  countries
1        UK
4     China

# `not in` operation
df[[x not in c1_set for x in df['countries']]]

  countries
0        US
2   Germany
3       NaN

그러나 지정하기가 훨씬 더 어려우므로 수행중인 작업을 모르는 경우 사용하지 마십시오.

마지막 으로이 답변DataFrame.query 에서 다루는 내용 도 있습니다 . numexpr FTW!

— cs95
소스

마음에 들지만 df1 열에있는 df3의 열을 비교하려면 어떻게해야합니까? 어떤 모습일까요?

— Arthur D. Howland

12

나는 일반적으로 다음과 같은 행에 대해 일반 필터링을 수행했습니다.

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]

— 코스
소스

10

참고로, 이것은 벡터화 된 @DSM soln보다 훨씬 느리다

— Jeff

@ Jeff 기대합니다.하지만 팬더에서 사용할 수없는 것을 직접 필터링해야 할 때 다시 생각합니다. (나는 .startwith 또는 정규식 일치와 같은 말을하려고했지만 Series.str에 대해 알게되었습니다!)

— Kos

7

BUSINESS_ID가 dfProfilesBusIds 인 BUSINESS_ID가있는 dfbc 행을 필터링하고 싶었습니다.

dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]

— 샘 헨더슨
소스

5

False

— OneCricketeer

6

답변에서 가능한 솔루션을 정리합니다.

IN의 경우 : df[df['A'].isin([3, 6])]

NOT IN의 경우 :

df[-df["A"].isin([3, 6])]
df[~df["A"].isin([3, 6])]
df[df["A"].isin([3, 6]) == False]
df[np.logical_not(df["A"].isin([3, 6]))]

— 아비 셰크 가우 르
소스

3

이것은 대부분 다른 답변의 정보를 반복합니다. 사용하는 logical_not것은 ~운영자 와 입에 맞는 것 입니다.

— cs95

3

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

구현 :

df[df.countries.isin(countries)]

나머지 국가 에서 와 같이 구현하지 마십시오 .

df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]

— 요 아니스 나시 오스
소스

SQL에서와 같이 'in'과 'not in'을 사용하여 Pandas 데이터 프레임을 필터링하는 방법

팬더 DataFrame에 대해 'in'과 'not in'을 구현하는 방법은 무엇입니까?

하나의 열을 기준으로 데이터 프레임 필터링 (시리즈에도 적용)

많은 열에서 필터링

주목할만한 언급 : numpy.isin,, query목록 이해 (문자열 데이터)

주목할만한 언급 : `numpy.isin`,, `query`목록 이해 (문자열 데이터)