Pandas read_csv 함수에서로드시 줄을 어떻게 필터링 할 수 있습니까?

Question 1

Pandas를 사용하여 메모리에로드 할 CSV 줄을 어떻게 필터링 할 수 있습니까? 이것은에서 찾아야 할 옵션처럼 보입니다 read_csv. 내가 뭔가를 놓치고 있습니까?

예 : 타임 스탬프 열이있는 CSV가 있고 주어진 상수보다 큰 타임 스탬프가있는 행만로드하려고합니다.

Question 2

CSV 파일이 pandas 객체에로드되기 전에 행을 필터링하는 옵션은 없습니다.

파일을로드 한 다음를 사용하여 필터링 df[df['field'] > constant]하거나, 매우 큰 파일이 있고 메모리 부족이 걱정되는 경우 반복자를 사용하고 파일 청크를 연결할 때 필터를 적용 할 수 있습니다. 예 :

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

당신은 다를 수 chunksize사용 가능한 메모리를 이용할 수 있습니다. 자세한 내용은 여기 를 참조하십시오.

Question 3

의 컨텍스트 내에서 직접 수행하는 방법을 찾지 못했습니다 read_csv. 그러나 read_csv부울 벡터로 행을 선택하여 필터링 할 수있는 DataFrame을 반환합니다 df[bool_vec].

filtered = df[(df['timestamp'] > targettime)]

이것은 열의 값이 targettime 의 값 보다 큰 df의 모든 행을 선택하는 것입니다 (df가 read_csv적어도 datetime 열을 포함하는 호출 결과와 같은 DataFrame이라고 가정 timestamp) timestamp. 비슷한 질문 입니다.

Question 4

필터링 된 범위가 연속적인 경우 (일반적으로 시간 (스탬프) 필터와 함께) 가장 빠른 솔루션은 행 범위를 하드 코딩하는 것입니다. 매개 변수 skiprows=range(1, start_row)와 결합 하기 만하면 nrows=end_row됩니다. 그런 다음 가져 오기에는 몇 초가 걸리며 수락 된 솔루션에는 몇 분이 걸립니다. 초기에 대한 몇 가지 실험 start_row은 수입 시간이 절약된다는 점을 감안할 때 큰 비용이 아닙니다. 를 사용하여 헤더 행을 유지했습니다 range(1,..).

Question 5

nrows매개 변수 를 지정할 수 있습니다 .

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

이 코드는 버전 0.20.3에서 잘 작동합니다.

Question 6

Linux를 사용하는 경우 grep을 사용할 수 있습니다.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO


def zgrep_data(f, string):
    '''grep multiple items f is filepath, string is what you are filtering for'''

    grep = 'grep' # change to zgrep for gzipped files
    print('{} for {} from {}'.format(grep,string,f))
    start_time = time()
    if string == '':
        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)
        data = pd.read_csv(grep_data, sep=',', header=0)

    else:
        # read only the first row to get the columns. May need to change depending on 
        # how the data is stored
        columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]    

        out = subprocess.check_output([grep, string, f])
        grep_data = StringIO(out)

        data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

    print('{} finished for {} - {} seconds'.format(grep,f,time()-start_time))
    return data