awk 또는 sed로 첫 번째 열을 기준으로 행을 연결

12

awk다음과 같은 상황에서 어떻게 사용할 수 있습니까?

같은 열로 시작하는 줄을 연결하고 싶습니다. 조인 후, 첫 번째 열이 유지된다 (이 경우 aaa, www, hhh).

파일은 공백으로 구분되거나 탭으로 구분 될 수 있습니다.

입력 예 :

aaa bbb ccc ddd NULL NULL NULL
aaa NULL NULL NULL NULL NULL NULL
aaa bbb ccc NULL NULL NULL NULL
www yyy hhh NULL NULL NULL NULL
hhh 111 333 yyy ooo hyy uuuioooy
hhh 111 333 yyy ooo hyy NULL

원하는 출력 :

aaa bbb ccc ddd NULL NULL NULL NULL NULL NULL NULL NULL NULL bbb ccc NULL NULL NULL NULL
www yyy hhh NULL NULL NULL NULL
hhh 111 333 yyy ooo hyy uuuioooy 111 333 yyy ooo hyy NULL

이것의 배경은 매우 간단한 파일 기반 데이터베이스를 설정하고 싶습니다. 첫 번째 열은 항상 엔티티의 식별자입니다. 동일한 식별자 열을 기반으로하는 모든 줄이 연결됩니다.

text-processing sed awk

— 작은
소스

1

어디 않았다 uuu라인 (출력에서)에서 오는?

— saeedn

미안해 내 잘못이다. 편집하겠습니다.

— 작은

8

awk를 사용하여 각 줄의 첫 번째 열을 얻으려면 다음을 수행하십시오.

< testfile awk '{print $1}'
aaa
aaa
aaa
www
hhh
hhh

이것들은 나머지 줄의 열쇠입니다. 따라서 첫 번째 열을 키로 사용하고 두 번째 열을 값으로 사용하여 해시 테이블을 만들 수 있습니다.

< testfile awk '{table[$1]=table[$1] $2;} END {for (key in table) print key " => " table[key];}'
www => yyy
aaa => bbbNULLbbb
hhh => 111111

열 2부터 시작하여 나머지 줄 전체를 얻으려면 모든 열을 수집해야합니다.

< testfile awk '{line="";for (i = 2; i <= NF; i++) line = line $i " "; table[$1]=table[$1] line;} END {for (key in table) print key " => " table[key];}'
www => yyy hhh NULL NULL NULL NULL 
aaa => bbb ccc ddd NULL NULL NULL NULL NULL NULL NULL NULL NULL bbb ccc    NULL NULL NULL NULL 
hhh => 111 333 yyy ooo hyy uuuioooy 111 333 yyy ooo hyy NULL

— 빈 거짓
소스

예, 테이블을 해시하려면 실제로 분류가 필요했습니다. 감사합니다!

— 작은

2

@tiny-주문을 보존해야한다고 가정했습니다. 그렇지 않은가? (이 답변은 원래 순서가 아닌 해싱 메커니즘에 해당하는 순서를 생성합니까?)

— ire_and_curses

3

다른 사람이 awk 또는 sed로 대답 할 수 있지만 Python 버전은 간단하며 도움이 될 수 있습니다.

#!/usr/bin/env python

input_file = 'input.dat'
in_fh      = open(input_file, 'r')

input_order = []
seen        = {}
for line in in_fh:    
    # Remove the newline character...
    line = line[:-1]

    # Separate the first column from the rest of the line...
    key_col, sep, rest_of_line = line.partition(" ")
    rest_of_line = sep + rest_of_line  

    # If we've seen this key already, concatenate the line...
    if key_col in seen:
        seen[key_col] += rest_of_line
    # ...otherwise, record the ordering, and store the new info
    else:
        input_order.append(key_col)
        seen[key_col] = rest_of_line

in_fh.close()

# Dump the ordered output to stdout
for unique_col in input_order:
    print unique_col + seen[unique_col]

— ire_and_curses
소스

매우 시원합니다. 제로 경험 파이썬으로 나는 심지어 입력 파일 이름으로 첫 번째 인수를 취하는 스크립트를 편집 할 수있었습니다 :)

— tiny

2

이것은 coreutils의 흥미로운 응용 프로그램입니다. 입력의 각 줄에 대해 조인을 호출하기 때문에 큰 입력에서는 그렇게 효율적이지 않다고 생각합니다.

touch outfile
while read; do
  join -a1 -a2 outfile <(echo $REPLY) > tmp
  mv tmp outfile
done < infile

효율성을 높이려면 저장 outfile및 tmp램 디스크에 도움이 될 수 있습니다.

편집하다

또는 임시 파일이없는 경우 :

out=""
while read; do
  out=$(join -a1 -a2 <(echo -n "$out") <(echo -n "$REPLY"))
done < infile

echo "$out"

— 토르
소스

2

그리고 여기 PERL 원 라이너가 있습니다 :

$ perl -e 'my %h; while(<>){chomp; @a=split(/\s+/); $k=shift(@a); $h{$k}.=join(" ", @a) . " "; } map{$h{$_}=~s/\s*$//; print "$_ $h{$_}\n}keys(%hash);' infile

— 테라 돈
소스