syslog 로그 파일에서 시간 범위를 빠르게 추출 하시겠습니까?

12

표준 syslog 형식의 로그 파일이 있습니다. 초당 수백 줄을 제외하고는 다음과 같습니다.

Jan 11 07:48:46 blahblahblah...
Jan 11 07:49:00 blahblahblah...
Jan 11 07:50:13 blahblahblah...
Jan 11 07:51:22 blahblahblah...
Jan 11 07:58:04 blahblahblah...

정확히 자정에 굴러 가지는 않지만 이틀이 넘지 않습니다.

이 파일에서 종종 타임 슬라이스를 추출해야합니다. 다음과 같이 호출 할 수있는 범용 스크립트를 작성하고 싶습니다.

$ timegrep 22:30-02:00 /logs/something.log

... 오후 22시 30 분부터 자정 경계를지나 다음 날 오전 2 시까 지 줄을 뽑습니다.

몇 가지주의 사항이 있습니다.

커맨드 라인에 날짜 만 입력하는 것을 귀찮게하고 싶지는 않습니다. 프로그램은 그것들을 알아낼 수있을 정도로 똑똑해야합니다.
로그 날짜 형식은 연도를 포함하지 않으므로 현재 연도를 기준으로 추측해야하지만 그럼에도 불구하고 새해 첫날에 올바른 일을합니다.
나는 그것이 빠르기를 원합니다. 파일에서 탐색하고 이진 검색을 사용하기 위해 행이 있다는 사실을 사용해야합니다.

이것을 쓰는 데 많은 시간을 보내기 전에 이미 존재합니까?

linux log-files grep

— 마이크
소스

9

업데이트 : 원래 코드를 여러 가지 개선 사항으로 업데이트 된 버전으로 교체했습니다. 이것을 (실제?) 알파 품질이라고합시다.

이 버전에는 다음이 포함됩니다.

명령 줄 옵션 처리
명령 줄 날짜 형식 유효성 검사
일부 try블록
라인 판독 기능으로 이동

원문 :

당신은 무엇을 알고 있습니까? "찾아라"그러면 찾을 것이다! 다음은 파일에서 탐색하고 다소 이진 검색을 사용하는 Python 프로그램입니다. 그건 상당히 빨리 그 AWK 스크립트 이외 의 다른 사람이 있음을 썼다.

알파 품질입니다. 그것은이 있어야 try블록 및 입력 검증 및 테스트를 많이하고 의심의 여지가 더 파이썬 수 없었다. 그러나 여기 당신의 즐거움입니다. 아, 그리고 그것은 파이썬 2.6을 위해 작성되었습니다.

새로운 코드 :

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# timegrep.py by Dennis Williamson 20100113
# in response to http://serverfault.com/questions/101744/fast-extraction-of-a-time-range-from-syslog-logfile

# thanks to serverfault user http://serverfault.com/users/1545/mike
# for the inspiration

# Perform a binary search through a log file to find a range of times
# and print the corresponding lines

# tested with Python 2.6

# TODO: Make sure that it works if the seek falls in the middle of
#       the first or last line
# TODO: Make sure it's not blind to a line where the sync read falls
#       exactly at the beginning of the line being searched for and
#       then gets skipped by the second read
# TODO: accept arbitrary date

# done: add -l long and -s short options
# done: test time format

version = "0.01a"

import os, sys
from stat import *
from datetime import date, datetime
import re
from optparse import OptionParser

# Function to read lines from file and extract the date and time
def getdata():
    """Read a line from a file

    Return a tuple containing:
        the date/time in a format such as 'Jan 15 20:14:01'
        the line itself

    The last colon and seconds are optional and
    not handled specially

    """
    try:
        line = handle.readline(bufsize)
    except:
        print("File I/O Error")
        exit(1)
    if line == '':
        print("EOF reached")
        exit(1)
    if line[-1] == '\n':
        line = line.rstrip('\n')
    else:
        if len(line) >= bufsize:
            print("Line length exceeds buffer size")
        else:
            print("Missing newline")
        exit(1)
    words = line.split(' ')
    if len(words) >= 3:
        linedate = words[0] + " " + words[1] + " " + words[2]
    else:
        linedate = ''
    return (linedate, line)
# End function getdata()

# Set up option handling
parser = OptionParser(version = "%prog " + version)

parser.usage = "\n\t%prog [options] start-time end-time filename\n\n\
\twhere times are in the form hh:mm[:ss]"

parser.description = "Search a log file for a range of times occurring yesterday \
and/or today using the current time to intelligently select the start and end. \
A date may be specified instead. Seconds are optional in time arguments."

parser.add_option("-d", "--date", action = "store", dest = "date",
                default = "",
                help = "NOT YET IMPLEMENTED. Use the supplied date instead of today.")

parser.add_option("-l", "--long", action = "store_true", dest = "longout",
                default = False,
                help = "Span the longest possible time range.")

parser.add_option("-s", "--short", action = "store_true", dest = "shortout",
                default = False,
                help = "Span the shortest possible time range.")

parser.add_option("-D", "--debug", action = "store", dest = "debug",
                default = 0, type = "int",
                help = "Output debugging information.\t\t\t\t\tNone (default) = %default, Some = 1, More = 2")

(options, args) = parser.parse_args()

if not 0 <= options.debug <= 2:
    parser.error("debug level out of range")
else:
    debug = options.debug    # 1 = print some debug output, 2 = print a little more, 0 = none

if options.longout and options.shortout:
    parser.error("options -l and -s are mutually exclusive")

if options.date:
    parser.error("date option not yet implemented")

if len(args) != 3:
    parser.error("invalid number of arguments")

start = args[0]
end   = args[1]
file  = args[2]

# test for times to be properly formatted, allow hh:mm or hh:mm:ss
p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$')

if not p.match(start) or not p.match(end):
    print("Invalid time specification")
    exit(1)

# Determine Time Range
yesterday = date.fromordinal(date.today().toordinal()-1).strftime("%b %d")
today     = datetime.now().strftime("%b %d")
now       = datetime.now().strftime("%R")

if start > now or start > end or options.longout or options.shortout:
    searchstart = yesterday
else:
    searchstart = today

if (end > start > now and not options.longout) or options.shortout:
    searchend = yesterday
else:
    searchend = today

searchstart = searchstart + " " + start
searchend = searchend + " " + end

try:
    handle = open(file,'r')
except:
    print("File Open Error")
    exit(1)

# Set some initial values
bufsize = 4096  # handle long lines, but put a limit them
rewind  =  100  # arbitrary, the optimal value is highly dependent on the structure of the file
limit   =   75  # arbitrary, allow for a VERY large file, but stop it if it runs away
count   =    0
size    =    os.stat(file)[ST_SIZE]
beginrange   = 0
midrange     = size / 2
oldmidrange  = midrange
endrange     = size
linedate     = ''

pos1 = pos2  = 0

if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstart, searchend))

# Seek using binary search
while pos1 != endrange and oldmidrange != 0 and linedate != searchstart:
    handle.seek(midrange)
    linedate, line = getdata()    # sync to line ending
    pos1 = handle.tell()
    if midrange > 0:             # if not BOF, discard first read
        if debug > 1: print("...partial: (len: {0}) '{1}'".format((len(line)), line))
        linedate, line = getdata()

    pos2 = handle.tell()
    count += 1
    if debug > 0: print("#{0} Beg: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".format(count, beginrange, midrange, endrange, pos1, pos2, linedate))
    if  searchstart > linedate:
        beginrange = midrange
    else:
        endrange = midrange
    oldmidrange = midrange
    midrange = (beginrange + endrange) / 2
    if count > limit:
        print("ERROR: ITERATION LIMIT EXCEEDED")
        exit(1)

if debug > 0: print("...stopping: '{0}'".format(line))

# Rewind a bit to make sure we didn't miss any
seek = oldmidrange
while linedate >= searchstart and seek > 0:
    if seek < rewind:
        seek = 0
    else:
        seek = seek - rewind
    if debug > 0: print("...rewinding")
    handle.seek(seek)

    linedate, line = getdata()    # sync to line ending
    if debug > 1: print("...junk: '{0}'".format(line))

    linedate, line = getdata()
    if debug > 0: print("...comparing: '{0}'".format(linedate))

# Scan forward
while linedate < searchstart:
    if debug > 0: print("...skipping: '{0}'".format(linedate))
    linedate, line = getdata()

if debug > 0: print("...found: '{0}'".format(line))

if debug > 0: print("Beg: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".format(beginrange, midrange, endrange, pos1, pos2, linedate))

# Now that the preliminaries are out of the way, we just loop,
#     reading lines and printing them until they are
#     beyond the end of the range we want

while linedate <= searchend:
    print line
    linedate, line = getdata()

if debug > 0: print("Start: '{0}' End: '{1}'".format(searchstart, searchend))
handle.close()

— 추후 공지가있을 때까지 일시 중지되었습니다.
소스

와. 나는 정말로 파이썬을 배워야한다 ...

— Stefan Lasiewski

@Dennis Williamson :가 포함 된 줄이 보입니다

if debug > 0: print("File: '{0}' Size: {1} Today: '{2}' Now: {3} Start: '{4}' End: '{5}'".format(file, size, today, now, searchstar$

. 은 searchstar로 끝날 가정 $, 또는 오타이다? 이 줄에 구문 오류가 발생합니다 (159 행)

— Stefan Lasiewski

@Stefan 나는 그것을로 대체 할 것입니다 )).

— Bill Weiss

@ 스테판 : 감사합니다. 내가 고친 오타였습니다. 빠른 참조를 들어, $대신해야 t, searchend))그것이 말하는 때문에... searchstart, searchend))

— 추후 공지가있을 때까지 일시 중지.

@ 스테판 : 죄송합니다. 나는 그것이 있다고 생각합니다.

— 추후 공지가있을 때까지 일시 중지되었습니다.

0

인터넷에서 빠른 검색을 수행하면 키워드를 기반으로 추출하는 항목 (예 : FIRE 등)이 있지만 파일에서 날짜 범위를 추출하는 것은 없습니다.

제안한 것을하는 것이 어렵지 않은 것 같습니다.

시작 시간을 검색하십시오.
그 줄을 인쇄하십시오.
종료 시간 <시작 시간이고 행 날짜가> 종료 및 <시작 인 경우 중지하십시오.
종료 시간이> 시작 시간이고 라인 날짜가> 종료 인 경우 중지하십시오.

똑바로 보입니다. 루비가 마음에 들지 않으면 쓸 수 있습니다. :)

— 마이클 그라프
소스

루비는 신경 쓰지 않지만 큰 파일에서 효율적으로 작업하려면 # 1이 간단하지 않습니다. 반쯤 지점으로 seek ()를 시도하고 가장 가까운 줄을 찾고 시작 방법을 확인하고 반복해야합니다. 새로운 중간 점. 모든 라인을보기에는 너무 비효율적입니다.

— Mike

크게 말했지만 실제 크기를 지정하지 않았습니다. 얼마나 큰가요? 더 나쁜 것은, 여러 날이 관련되어 있다면, 시간을 사용해서 만 잘못된 것을 찾는 것이 매우 쉽다는 것입니다. 결국 하루 경계를 넘으면 스크립트가 실행되는 날짜는 항상 시작 시간과 다릅니다. 파일이 mmap ()을 통해 메모리에 맞습니까?

— Michael Graff

네트워크 마운트 디스크에서 약 30GB

— mike

0

현재 시간과의 관련성 ( "현재")에 따라 시작 시간과 종료 시간 사이의 항목 범위를 인쇄합니다.

용법:

timegrep [-l] start end filename

예:

$ timegrep 18:47 03:22 /some/log/file

-l(긴) 옵션은 가능한 가장 긴 출력됩니다. 시작 시간의시 및 분 값이 종료 시간과 현재 시간보다 작은 경우 시작 시간은 어제로 해석됩니다. 시작 시간과 종료 시간 HH : MM 값이 모두 "지금"보다 큰 경우 종료 시간은 오늘로 해석됩니다.

"현재"가 "Jan 11 19:00"이라고 가정하면 다음은 시작 및 종료 시간의 다양한 예를 해석하는 방법입니다 ( -l표시된 것을 제외하고).

시작 끝 범위 시작 범위 끝
19:01 23:59 1 월 10 일 1 월 10 일
19:01 00:00 1 월 10 일 1 월 11 일
00:00 18:59 1 월 11 일 1 월 11 일
18:59 18:58 1 월 10 일 1 월 10 일
19:01 23:59 1 월 10 일 1 월 11 일 # -l
00:00 18:59 1 월 10 일 1 월 11 일 # -l
18:59 19:01 1 월 10 일 1 월 11 일 # -l

거의 모든 스크립트가 설정되었습니다. 마지막 두 줄은 모든 작업을 수행합니다.

경고 : 인수 유효성 검사 또는 오류 검사가 수행되지 않습니다. 엣지 케이스는 철저히 테스트되지 않았습니다. 이것은 gawk다른 버전의 AWK를 사용하여 작성되었습니다 .

#!/usr/bin/awk -f
BEGIN {
    arg=1
    if ( ARGV[arg] == "-l" ) {
        long = 1
        ARGV[arg++] = ""
    }
    start = ARGV[arg]
    ARGV[arg++] = ""
    end = ARGV[arg]
    ARGV[arg++] = ""

    yesterday = strftime("%b %d", mktime(strftime("%Y %m %d -24 00 00")))
    today = strftime("%b %d")
    now = strftime("%R")

    if ( start > now || start > end || long )
        startdate = yesterday
    else
        startdate = today

    if ( end > now && end > start && start > now && ! long )
        enddate = yesterday
    else
        enddate = today
    fi

startdate = startdate " " start
enddate = enddate " " end
}

$1 " " $2 " " $3 > enddate {exit}
$1 " " $2 " " $3 >= startdate {print}

AWK가 파일을 검색하는 데 매우 효율적이라고 생각합니다. 색인화되지 않은 텍스트 파일 을 검색 할 때 다른 것이 반드시 더 빠를 것이라고 생각하지 않습니다 .

— 추후 공지가있을 때까지 일시 중지되었습니다.
소스

내 세 번째 글 머리 기호를 간과 한 것 같습니다. 로그의 순서는 30GB입니다. 파일의 첫 번째 줄이 7:00이고 마지막 줄이 23:00이고 22:00에서 22:01 사이의 슬라이스를 원하면 원하지 않습니다. 스크립트는 7:00에서 22:00 사이의 모든 줄을 봅니다. 나는 그것이 어디에 있을지 추정하고, 그 시점을 찾고, 그것을 찾을 때까지 새로운 추정을하기를 원합니다.

— mike

나는 그것을 간과하지 않았다. 나는 마지막 단락에 내 의견을 표명했다.

— 추후 공지가있을 때까지 일시 중지되었습니다.

0

이진 검색을 적용하는 C ++ 프로그램-텍스트 날짜를 처리하려면 간단한 수정 (예 : strptime 호출)이 필요합니다.

http://gitorious.org/bs_grep/

텍스트 날짜를 지원하는 이전 버전이 있었지만 여전히 로그 파일의 규모에 비해 너무 느 렸습니다. 프로파일 링에 따르면 시간의 90 % 이상이 strptime에 소비되었다고 말하면서 숫자 unix 타임 스탬프도 포함하도록 로그 형식을 수정했습니다.

0

이 답변이 너무 늦었지만 일부에게는 도움이 될 수 있습니다.

@Dennis Williamson의 코드를 다른 파이썬 물건에 사용할 수있는 Python 클래스로 변환했습니다.

여러 날짜 지원에 대한 지원을 추가했습니다.

import os
from stat import *
from datetime import date, datetime
import re

# @TODO Support for rotated log files - currently using the current year for 'Jan 01' dates.
class LogFileTimeParser(object):
    """
    Extracts parts of a log file based on a start and enddate
    Uses binary search logic to speed up searching

    Common usage: validate log files during testing

    Faster than awk parsing for big log files
    """
    version = "0.01a"

    # Set some initial values
    BUF_SIZE = 4096  # self.handle long lines, but put a limit to them
    REWIND = 100  # arbitrary, the optimal value is highly dependent on the structure of the file
    LIMIT = 75  # arbitrary, allow for a VERY large file, but stop it if it runs away

    line_date = ''
    line = None
    opened_file = None

    @staticmethod
    def parse_date(text, validate=True):
        # Supports Aug 16 14:59:01 , 2016-08-16 09:23:09 Jun 1 2005  1:33:06PM (with or without seconds, miliseconds)
        for fmt in ('%Y-%m-%d %H:%M:%S %f', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M',
                    '%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S',
                    '%b %d %Y %H:%M:%S %f', '%b %d %Y %H:%M', '%b %d %Y %H:%M:%S',
                    '%b %d %Y %I:%M:%S%p', '%b %d %Y %I:%M%p', '%b %d %Y %I:%M:%S%p %f'):
            try:
                if fmt in ['%b %d %H:%M:%S %f', '%b %d %H:%M', '%b %d %H:%M:%S']:

                    return datetime.strptime(text, fmt).replace(datetime.now().year)
                return datetime.strptime(text, fmt)
            except ValueError:
                pass
        if validate:
            raise ValueError("No valid date format found for '{0}'".format(text))
        else:
            # Cannot use NoneType to compare datetimes. Using minimum instead
            return datetime.min

    # Function to read lines from file and extract the date and time
    def read_lines(self):
        """
        Read a line from a file
        Return a tuple containing:
            the date/time in a format supported in parse_date om the line itself
        """
        try:
            self.line = self.opened_file.readline(self.BUF_SIZE)
        except:
            raise IOError("File I/O Error")
        if self.line == '':
            raise EOFError("EOF reached")
        # Remove \n from read lines.
        if self.line[-1] == '\n':
            self.line = self.line.rstrip('\n')
        else:
            if len(self.line) >= self.BUF_SIZE:
                raise ValueError("Line length exceeds buffer size")
            else:
                raise ValueError("Missing newline")
        words = self.line.split(' ')
        # This results into Jan 1 01:01:01 000000 or 1970-01-01 01:01:01 000000
        if len(words) >= 3:
            self.line_date = self.parse_date(words[0] + " " + words[1] + " " + words[2],False)
        else:
            self.line_date = self.parse_date('', False)
        return self.line_date, self.line

    def get_lines_between_timestamps(self, start, end, path_to_file, debug=False):
        # Set some initial values
        count = 0
        size = os.stat(path_to_file)[ST_SIZE]
        begin_range = 0
        mid_range = size / 2
        old_mid_range = mid_range
        end_range = size
        pos1 = pos2 = 0

        # If only hours are supplied
        # test for times to be properly formatted, allow hh:mm or hh:mm:ss
        p = re.compile(r'(^[2][0-3]|[0-1][0-9]):[0-5][0-9](:[0-5][0-9])?$')
        if p.match(start) or p.match(end):
            # Determine Time Range
            yesterday = date.fromordinal(date.today().toordinal() - 1).strftime("%Y-%m-%d")
            today = datetime.now().strftime("%Y-%m-%d")
            now = datetime.now().strftime("%R")
            if start > now or start > end:
                search_start = yesterday
            else:
                search_start = today
            if end > start > now:
                search_end = yesterday
            else:
                search_end = today
            search_start = self.parse_date(search_start + " " + start)
            search_end = self.parse_date(search_end + " " + end)
        else:
            # Set dates
            search_start = self.parse_date(start)
            search_end = self.parse_date(end)
        try:
            self.opened_file = open(path_to_file, 'r')
        except:
            raise IOError("File Open Error")
        if debug:
            print("File: '{0}' Size: {1} Start: '{2}' End: '{3}'"
                  .format(path_to_file, size, search_start, search_end))

        # Seek using binary search -- ONLY WORKS ON FILES WHO ARE SORTED BY DATES (should be true for log files)
        try:
            while pos1 != end_range and old_mid_range != 0 and self.line_date != search_start:
                self.opened_file.seek(mid_range)
                # sync to self.line ending
                self.line_date, self.line = self.read_lines()
                pos1 = self.opened_file.tell()
                # if not beginning of file, discard first read
                if mid_range > 0:
                    if debug:
                        print("...partial: (len: {0}) '{1}'".format((len(self.line)), self.line))
                    self.line_date, self.line = self.read_lines()
                pos2 = self.opened_file.tell()
                count += 1
                if debug:
                    print("#{0} Beginning: {1} Mid: {2} End: {3} P1: {4} P2: {5} Timestamp: '{6}'".
                          format(count, begin_range, mid_range, end_range, pos1, pos2, self.line_date))
                if search_start > self.line_date:
                    begin_range = mid_range
                else:
                    end_range = mid_range
                old_mid_range = mid_range
                mid_range = (begin_range + end_range) / 2
                if count > self.LIMIT:
                    raise IndexError("ERROR: ITERATION LIMIT EXCEEDED")
            if debug:
                print("...stopping: '{0}'".format(self.line))
            # Rewind a bit to make sure we didn't miss any
            seek = old_mid_range
            while self.line_date >= search_start and seek > 0:
                if seek < self.REWIND:
                    seek = 0
                else:
                    seek -= self.REWIND
                if debug:
                    print("...rewinding")
                self.opened_file.seek(seek)
                # sync to self.line ending
                self.line_date, self.line = self.read_lines()
                if debug:
                    print("...junk: '{0}'".format(self.line))
                self.line_date, self.line = self.read_lines()
                if debug:
                    print("...comparing: '{0}'".format(self.line_date))
            # Scan forward
            while self.line_date < search_start:
                if debug:
                    print("...skipping: '{0}'".format(self.line_date))
                self.line_date, self.line = self.read_lines()
            if debug:
                print("...found: '{0}'".format(self.line))
            if debug:
                print("Beginning: {0} Mid: {1} End: {2} P1: {3} P2: {4} Timestamp: '{5}'".
                      format(begin_range, mid_range, end_range, pos1, pos2, self.line_date))
            # Now that the preliminaries are out of the way, we just loop,
            # reading lines and printing them until they are beyond the end of the range we want
            while self.line_date <= search_end:
                # Exclude our 'Nonetype' values
                if not self.line_date == datetime.min:
                    print self.line
                self.line_date, self.line = self.read_lines()
            if debug:
                print("Start: '{0}' End: '{1}'".format(search_start, search_end))
            self.opened_file.close()
        # Do not display EOFErrors:
        except EOFError as e:
            pass

— 제프리 데 브루
소스