다른 TXT 파일에 존재하는 TXT 파일에서 단어를 삭제하는 방법은 무엇입니까?

8

파일 a.txt에는 약 100k 단어가 있으며 각 단어는 줄 바꿈입니다

july.cpp
windows.exe
ttm.rar
document.zip

파일 b.txt 에는 150k 단어가 있으며 한 줄에 하나씩 있습니다. 일부 단어는 file a.txt에서 왔지만 일부 단어는 새로운 것입니다.

july.cpp    
NOVEMBER.txt    
windows.exe    
ttm.rar    
document.zip    
diary.txt

이 파일을 하나로 병합하고 모든 중복 줄을 삭제하고 새로운 줄 (존재 a.txt하지만 존재하지 않는 줄)을 유지하는 방법b.txt 합니까?

text-processing

— 케이트 카시아
소스

파이썬을 기꺼이 사용 하시겠습니까?

— Tim

2

@ MikołajBartnicki Unix.SE 는 아마도 더 좋은 곳일 것입니다

— Glutanimate

1

카시아, 내 대답에 실수를 했으므로 삭제했습니다. 나는 새로운 것을 만들고 있습니다.

2

@Glutanimate이 질문은 여기서 완벽하게 좋습니다.

— Seth

1

@Glutanimate 아 아, 죄송합니다. 어쨌든 그 의견을 놓쳤습니다.

— Seth

13

이를 수행하는 명령이 comm있습니다.. 에 명시된 것처럼 man comm간단합니다.

   comm -3 file1 file2
          Print lines in file1 not in file2, and vice versa.

참고 comm당신은 호출하기 전에으로 정렬합니다 때문에 예상하는 파일의 내용이, 정렬 할 수있는 comm단지처럼, 그들에 :

sort unsorted-file.txt > sorted-file.txt

요약하자면 다음과 같습니다.

sort a.txt > as.txt

sort b.txt > bs.txt

comm -3 as.txt bs.txt > result.txt

위의 명령 후에는 result.txt파일에 예상 줄이 있습니다.

고마워요, 그것은 매력처럼 작동합니다. 추신. ;-)

— Kate-Kasia 님이

2

다음은 Germar의 답변을 기반으로 한 짧은 python3 스크립트이며 b.txt정렬되지 않은 순서 를 유지 하면서이 작업을 수행해야합니다 .

#!/usr/bin/python3

with open('a.txt', 'r') as afile:
    a = set(line.rstrip('\n') for line in afile)

with open('b.txt', 'r') as bfile:
    for line in bfile:
        line = line.rstrip('\n')
        if line not in a:
            print(line)
            # Uncomment the following if you also want to remove duplicates:
            # a.add(line)

— 릴리 정
소스

1

#!/usr/bin/env python3

with open('a.txt', 'r') as f:
    a_txt = f.read()
a = a_txt.split('\n')
del(a_txt)

with open('b.txt', 'r') as f:
    while True:
        b = f.readline().strip('\n ')
        if not len(b):
            break
        if not b in a:
            print(b)

— 게르 마
소스

2

해군 대포로 모기를 쏘고 있습니다!

:-) 네가 옳아. 나는 100k에서 'k'를 놓쳤다

— Germar

1

coreutils comm명령을 살펴보십시오 -man comm

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... FILE1 FILE2

DESCRIPTION
       Compare sorted files FILE1 and FILE2 line by line.

       With  no  options,  produce  three-column  output.  Column one contains
       lines unique to FILE1, column two contains lines unique to  FILE2,  and
       column three contains lines common to both files.

       -1     suppress column 1 (lines unique to FILE1)

       -2     suppress column 2 (lines unique to FILE2)

       -3     suppress column 3 (lines that appear in both files)

예를 들어 당신은 할 수 있습니다

$ comm -13 <(sort a.txt) <(sort b.txt)
diary.txt
NOVEMBER.txt

(에 고유 한 줄 b.txt)

— 스틸 드라이버
소스