Python을 사용하여 HTML 파일에서 텍스트 추출

243

파이썬을 사용하여 HTML 파일에서 텍스트를 추출하고 싶습니다. 브라우저에서 텍스트를 복사하여 메모장에 붙여 넣으면 본질적으로 동일한 결과를 원합니다.

형식이 잘못된 HTML에서 실패 할 수있는 정규 표현식을 사용하는 것보다 더 강력한 것을 원합니다. 많은 사람들이 뷰티플 수프를 추천하는 것을 보았지만 사용하는데 몇 가지 문제가있었습니다. 우선, JavaScript 소스와 같은 원치 않는 텍스트를 선택했습니다. 또한 HTML 엔터티를 해석하지 않았습니다. 예를 들어 & # 39; 브라우저 소스를 메모장에 붙여 넣은 것처럼 HTML 소스에서 텍스트의 아포스트로피로 변환됩니다.

업데이트 html2text 가 유망 해 보입니다. HTML 엔터티를 올바르게 처리하고 JavaScript를 무시합니다. 그러나 정확하게 일반 텍스트를 생성하지는 않습니다. 마크 다운을 생성 한 다음 일반 텍스트로 바꿔야합니다. 예제 나 문서는 없지만 코드는 깨끗해 보입니다.

관련 질문 :

— 존 디 쿡
소스

꽤 오랫동안 사람들은 내 NLTK 답변 (최근 최신)이 매우 유용하다는 것을 알고 있으므로 수락 된 답변을 변경하는 것이 좋습니다. 감사!

— Shatu

1

내가 좋아하는 블로그 작성자의 질문에 답할 줄은 몰랐습니다. 노력!

— Ryan G

1

@Shatu 이제 솔루션이 더 이상 유효하지 않으므로 주석을 삭제하고 싶을 수 있습니다. 감사! ;)

— Sнаđошƒаӽ

136

html2text 는 이것에서 꽤 잘하는 파이썬 프로그램입니다.

— 렉스
소스

5

bit gpl 3.0으로 호환되지 않을 수 있습니다.

— frog32

138

놀랄 만한! 저자는 RIP Aaron Swartz입니다.

— Atul Arvind

2

GPL 3.0으로 인해 html2text에 대한 대안을 찾은 사람이 있습니까?

— jontsai

1

사람들이 원하는만큼 GPL이 나쁘지 않습니다. 아론은 가장 잘 알고있었습니다.

— Steve K

2

html2text와 nltk를 모두 시도했지만 그들은 나를 위해 작동하지 않았습니다. 나는 아름답게 작동하는 Beautiful Soup 4로 끝났습니다.

— Ryan

150

자바 스크립트를 얻거나 원하지 않는 것을 얻지 않고 텍스트를 추출하기 위해 찾은 최고의 코드 조각 :

import urllib
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

다음 이전에 BeautifulSoup을 설치해야합니다.

pip install beautifulsoup4

— PeYoTlL
소스

2

우리가 라인 3을 선택하려면 어떻게해야합니까?

— hepidad

3

살인 스크립트 비트, 구세주 !!

— Nanda

2

많은 stackoverflow 답변을 거친 후 이것이 최선의 선택이라고 생각합니다. 내가 겪은 한 가지 문제는 경우에 따라 행이 함께 추가되었다는 것입니다. get_text 함수에 구분 기호를 추가하여이를 극복 할 수있었습니다.text = soup.get_text(separator=' ')

— Joswin KJ

5

제목 대신 > 요소 에서 텍스트를 얻지 못하도록을 soup.get_text()사용 했습니다. soup.body.get_text()<head

— Sjoerd

10

Python 3의 경우from urllib.request import urlopen

— Jacob Kalakal Joseph

99

참고 : NTLK는 더 이상 clean_html기능을 지원하지 않습니다

아래의 원래 답변과 의견 섹션의 대안.

NLTK 사용

html2text 문제를 해결하는 데 4-5 시간을 낭비했습니다. 운 좋게도 NLTK를 만날 수있었습니다.
마술처럼 작동합니다.

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

— 샤투
소스

8

때로는 충분합니다 :)

— Sharmila

8

나는 이것을 수천 번 투표하고 싶다. 나는 정규식 지옥에 갇혀 있었지만 이제는 NLTK의 지혜를 봅니다.

— BenDundee

26

분명히, clean_html 더 이상 지원되지 않습니다 github.com/nltk/nltk/commit/...을

— alexanderlukanin13

5

같은 간단한 작업 NLTK 같은 무거운 라이브러리를 가져 오면 될 것이다 너무 많은

— 리치

54

@ alexanderlukanin13 출처에서 :raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

— Chris Arena

54

오늘도 같은 문제에 직면했습니다. 나는 모든 마크 업의 들어오는 내용을 제거하기 위해 매우 간단한 HTML 파서를 작성하여 최소한의 서식으로 나머지 텍스트를 반환했습니다.

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

— 엑스 페로 니
소스

5

이것은 기본 모듈 만 사용하여 Python (2.7) 에서이 작업을 수행하는 가장 간단한 방법 인 것 같습니다. 이것은 일반적으로 필요한 것이므로 기본 HTMLParser 모듈에 파서가없는 이유는 없습니다.

— Ingmar Hupp

2

HTML 문자를 유니 코드로 변환하지 않을 것입니다. 예를 들어 &로 변환되지 &않습니까?

— speedplane

Python 3 사용from html.parser import HTMLParser

— sebhaase

14

다음은 약간 더 완전한 xperroni의 답변 버전입니다. 스크립트 및 스타일 섹션을 건너 뛰고 문자 참조 (예 : & # 39;) 및 HTML 엔티티 (예 : & amp;)를 번역합니다.

또한 사소한 일반 텍스트 -html 역변환 기가 포함되어 있습니다.

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._buf = []
        self.hide_output = False

    def handle_starttag(self, tag, attrs):
        if tag in ('p', 'br') and not self.hide_output:
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = True

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self._buf.append('\n')

    def handle_endtag(self, tag):
        if tag == 'p':
            self._buf.append('\n')
        elif tag in ('script', 'style'):
            self.hide_output = False

    def handle_data(self, text):
        if text and not self.hide_output:
            self._buf.append(re.sub(r'\s+', ' ', text))

    def handle_entityref(self, name):
        if name in name2codepoint and not self.hide_output:
            c = unichr(name2codepoint[name])
            self._buf.append(c)

    def handle_charref(self, name):
        if not self.hide_output:
            n = int(name[1:], 16) if name.startswith('x') else int(name)
            self._buf.append(unichr(n))

    def get_text(self):
        return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
    """
    Given a piece of HTML, return the plain text it contains.
    This handles entities and char refs, but not javascript and stylesheets.
    """
    parser = _HTMLToText()
    try:
        parser.feed(html)
        parser.close()
    except HTMLParseError:
        pass
    return parser.get_text()

def text_to_html(text):
    """
    Convert the given text to html, wrapping what looks like URLs with <a> tags,
    converting newlines to <br> tags and converting confusing chars into html
    entities.
    """
    def f(mo):
        t = mo.group()
        if len(t) == 1:
            return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t)
        return '<a href="%s">%s</a>' % (t, t)
    return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

— 비트 4
소스

파이썬 3 버전 : gist.github.com/Crazometer/af441bc7dc7353d41390a59f20f07b51

— Crazometer

get_text에서 ''.join은 ''.join이어야합니다. 빈 공간이 있어야합니다. 그렇지 않으면 일부 텍스트가 함께 결합됩니다.

— Obinna Nnenanya

1

또한 H1, H2 ...., span 등과 같은 다른 텍스트 컨테이너 태그를 포함하는 것을 제외하고는 모든 텍스트를 포착하지 못합니다. 더 나은 적용 범위를 위해 그것을 조정해야했습니다.

— Obinna Nnenanya

11

나는 이미 많은 답변이 있다는 것을 알고 있지만, 내가 찾은 가장 우아 하고 파이썬적인 해결책은 부분적으로 여기 에 설명되어 있습니다 .

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

최신 정보

프레이저의 의견을 바탕으로보다 우아한 솔루션이 있습니다.

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

— 플로이드
소스

2

경고를 피하려면 BeautifulSoup에 사용할 구문 분석기를 지정하십시오.text = ''.join(BeautifulSoup(some_html_string, "lxml").findAll(text=True))

— Floyd

stripped_strings 생성기를 사용하여 과도한 공백을 피할 수 있습니다 (예 :clean_text = ''.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings

— Fraser

8

stripogram 라이브러리에서도 html2text 메소드를 사용할 수 있습니다.

from stripogram import html2text
text = html2text(your_html_string)

스트립 그램을 설치하려면 sudo easy_install stripogram을 실행하십시오.

— eek 탄트라
소스

23

pypi page 에 따르면이 모듈 은 더 이상 사용되지 않습니다. "이 패키지를 사용해야하는 역사적인 이유가 없다면 권장하지 않습니다!"

— intuited

7

데이터 마이닝을위한 패턴 라이브러리가 있습니다.

http://www.clips.ua.ac.be/pages/pattern-web

유지할 태그를 결정할 수도 있습니다.

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

— 넌조
소스

6

PyParsing은 훌륭한 일을합니다. PyParsing Wiki가 종료되었으므로 여기 PyParsing 사용 예 ( 예 : link )가 있습니다. pyparsing으로 약간의 시간을 투자 한 한 가지 이유는 또한 매우 간단하고 매우 체계적인 O'Reilly Short Cut 매뉴얼도 작성했기 때문입니다.

나는 BeautifulSoup을 많이 사용하고 엔티티 문제를 다루기가 어렵지 않으므로 BeautifulSoup을 실행하기 전에 변환 할 수 있습니다.

행운을 빕니다

— 파이 뉴비
소스

1

링크가 죽었거나 신맛이났습니다.

— Yvette

4

이것은 정확히 파이썬 솔루션은 아니지만 Javascript가 생성하는 텍스트를 텍스트로 변환하므로 중요하다고 생각합니다 (EG google.com). Lynx가 아닌 브라우저 링크에는 Javascript 엔진이 있으며 -dump 옵션을 사용하여 소스를 텍스트로 변환합니다.

그래서 당신은 다음과 같은 것을 할 수 있습니다 :

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

— 앤드류
소스

4

HTMLParser 모듈 대신 htmllib를 확인하십시오. 인터페이스는 비슷하지만 더 많은 작업을 수행합니다. (이것은 꽤 오래되었으므로 자바 스크립트와 CSS를 제거하는 데별로 도움이되지 않습니다. 파생 클래스를 만들 수는 있지만 start_script 및 end_style과 같은 이름을 가진 메소드를 추가 할 수 있습니다 (자세한 내용은 파이썬 문서 참조). 형식이 잘못된 HTML에 대해 안정적으로이 작업을 수행합니다.) 어쨌든 일반 텍스트를 콘솔에 인쇄하는 간단한 방법이 있습니다.

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

— 표
소스

주의 : HTMLError와 HTMLParserError는 모두 HTMLParseError를 읽어야합니다. 이것은 작동하지만 줄 바꿈을 유지 관리하는 나쁜 작업을 수행합니다.

— Dave Knight

4

goose-extractor Goose라는 Python 패키지는 다음 정보를 추출하려고 시도합니다.

기사의 본문 텍스트 기사의 주요 이미지 기사에 포함 된 모든 Youtube / Vimeo 동영상 메타 설명 메타 태그

더 : https://pypi.python.org/pypi/goose-extractor/

— 리 잉준
소스

4

더 빠른 속도와 정확도가 필요하지 않으면 raw lxml을 사용할 수 있습니다.

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

— 안톤 셸린
소스

4

를 사용하여 html2text 설치

pip install html2text

그때,

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!")
Hello, world!

— 프라 비타 V
소스

4

나는 이미 여기에 대한 답변을 많이가 알고하지만 난 생각 newspaper3k 도 언급 할 가치가있다. 최근 웹에서 기사에서 텍스트를 추출하는 비슷한 작업을 완료해야 했으며이 라이브러리는 지금까지 테스트 에서이 작업을 훌륭하게 수행했습니다. 메뉴 항목 및 사이드 바에있는 텍스트와 페이지에 OP 요청으로 나타나는 JavaScript를 무시합니다.

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

이미 HTML 파일을 다운로드 한 경우 다음과 같은 작업을 수행 할 수 있습니다.

article = Article('')
article.set_html(html)
article.parse()
article.text

기사 주제를 요약하기위한 몇 가지 NLP 기능도 있습니다.

article.nlp()
article.summary

— 스파 텔
소스

3

아름다운 수프는 HTML 엔티티를 변환합니다. HTML이 종종 버그가 있고 유니 코드 및 HTML 인코딩 문제로 가득 차 있다고 생각하는 것이 가장 좋습니다. 이것은 HTML을 원시 텍스트로 변환하는 데 사용하는 코드입니다.

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

— 스피드 플레인
소스

3

다른 옵션은 텍스트 기반 웹 브라우저를 통해 HTML을 실행하고 덤프하는 것입니다. 예를 들어 (Lynx 사용) :

lynx -dump html_to_convert.html > converted_html.txt

다음과 같이 파이썬 스크립트 내에서 수행 할 수 있습니다.

import subprocess

with open('converted_html.txt', 'w') as outputFile:
    subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

HTML 파일의 텍스트를 정확하게 제공하지는 않지만 사용 사례에 따라 html2text 출력보다 선호 될 수 있습니다.

— 존 루카스
소스

3

나를 위해 가장 잘 일한 것은 inscripts입니다.

https://github.com/weblyzard/inscriptis

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

결과는 정말 좋습니다

— 정력
소스

2

Python 이외의 다른 솔루션 : Libre Office :

soffice --headless --invisible --convert-to txt input1.html

내가 다른 대안보다 이것을 선호하는 이유는 모든 HTML 단락이 단일 텍스트 줄 (줄 바꿈 없음)로 변환되기 때문입니다. 다른 방법에는 후 처리가 필요합니다. Lynx는 훌륭한 결과물을 만들어 내지 만 내가 찾던 것과 정확히 일치하지는 않습니다. 게다가, Libre Office는 모든 종류의 형식에서 변환하는 데 사용할 수 있습니다 ...

— 야 코크
소스

2

누구나 표백제를bleach.clean(html,tags=[],strip=True) 사용해 보셨습니까 ? 그것은 나를 위해 일하고 있습니다.

— 약
소스

나에게도 효과가있는 것 같지만이 목적으로 사용하지 않는 것이 좋습니다. "이 기능은 웹에서 컨텐츠로 표시 될 수 있도록 문자열에서 악성 컨텐츠를 제거하는 것이 보안 목적의 기능입니다. 페이지." -> bleach.readthedocs.io/en/latest/clean.html#bleach.clean

— Loktopus

2

Apache Tika로 좋은 결과를 얻었습니다. . 그 목적은 컨텐츠에서 메타 데이터와 텍스트를 추출하는 것이므로 기본 파서는 상자에서 적절히 조정됩니다.

Tika는 서버 로 실행할 수 있고 Docker 컨테이너에서 실행 / 배포하기 쉽기 때문에 Python 바인딩을 통해 액세스 할 수 있습니다 .

— 유포리아
소스

1

간단한 방법으로

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

이 코드는 '<'로 시작하고 '>'로 끝나는 html_text의 모든 부분을 찾고 빈 문자열로 찾은 모든 것을 바꿉니다.

— 데이비드 프라가
소스

1

BeautifulSoup을 사용하고 스타일과 스크립트 내용을 제거하는 @PeYoTIL의 대답은 나에게 효과적이지 않았습니다. decompose대신 대신 사용 extract했지만 여전히 작동하지 않았습니다. 그래서 <p>태그를 사용하여 텍스트를 형식화하고 <a>태그를 href 링크로 바꾸는 내 자신을 만들었습니다 . 텍스트 안의 링크에도 대응합니다. 테스트 문서가 포함 된 이 요지 에서 사용할 수 있습니다.

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

— 암송
소스

1

감사합니다.이 답변은 과소 평가되었습니다. 브라우저처럼 동작하는 깔끔한 텍스트 표현 (줄 바꿈 무시, 단락 및 줄 바꿈 만 고려)을 원하는 사용자에게는 BeautifulSoup이 get_text단순히 잘리지 않습니다.

— jrial

@ jrial 당신이 유용하다는 것을 알게되어 기쁘다. 다른 사람에게는 요점 연결이 상당히 향상되었습니다. OP가 암시하는 것은 lynx와 같은 텍스트 기반 브라우저와 마찬가지로 HTML을 텍스트로 렌더링하는 도구입니다. 이것이이 솔루션이 시도하는 것입니다. 대부분의 사람들이 기여하는 것은 텍스트 추출기입니다.

— racitup

1

Python 3.x에서는 'imaplib'및 'email'패키지를 가져 와서 매우 쉬운 방법으로 수행 할 수 있습니다. 이것은 오래된 게시물이지만 내 대답은이 게시물의 새로운 사용자를 도울 수 있습니다.

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1]) 
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
    for part in email_msg.walk():       
        if part.get_content_type() == "text/plain":
            body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
            body = body.decode()
        elif part.get_content_type() == "text/html":
            continue

이제 본문 변수를 인쇄 할 수 있으며 일반 텍스트 형식으로 표시됩니다.

— 와히 울 학
소스

이것은 아무것도 변환 하지 않습니다 .

— Antti Haapala

1

text/plain다른 사람이 부품을 넣은 경우 이메일에서 부품 을 추출하는 방법을 보여줍니다 . HTML을 일반 텍스트로 변환하는 작업은 수행하지 않으며 웹 사이트에서 HTML을 변환하려는 경우 원격으로 유용한 기능이 없습니다.

— tripleee

1

BeautifulSoup을 사용하여 HTML에서 텍스트 만 추출 할 수 있습니다

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

— 사이 고피 N
소스

1

많은 사람들이 정규식을 사용하여 html 태그를 제거한다고 언급했지만 많은 단점이 있습니다.

예를 들면 다음과 같습니다.

<p>hello&nbsp;world</p>I love you

파싱해야합니다 :

Hello world
I love you

여기에 내가 발췌 한 조각이 있습니다. 귀하의 특정 요구에 부응 할 수 있으며 매력처럼 작동합니다.

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

— 유리 고렌
소스

1

Python 2.7.9+에서 BeautifulSoup4를 사용하는 또 다른 예

다음을 포함합니다 :

import urllib2
from bs4 import BeautifulSoup

암호:

def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

설명 :

url 데이터를 html로 읽고 (BeautifulSoup 사용) 모든 스크립트 및 스타일 요소를 제거하고 .get_text ()를 사용하여 텍스트 만 가져옵니다. 줄로 나누고 각 줄의 앞뒤 공백을 제거한 다음 여러 헤드 라인을 줄로 나눕니다. 그런 다음 text = '\ n'.join을 사용하여 빈 줄을 삭제하고 마지막으로 승인 된 utf-8로 반환하십시오.

노트:

SSL 문제로 인해이 시스템이 실행되는 일부 시스템은 https : // 연결로 실패합니다. 해당 문제를 해결하기 위해 확인 기능을 해제 할 수 있습니다. 수정 예 : http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
Python <2.7.9에서이를 실행하는 데 문제가있을 수 있습니다.
text.encode ( 'utf-8')은 이상한 인코딩을 남길 수 있습니다. 대신 str (text)를 반환하고 싶을 수도 있습니다.

— 마이크 Q
소스

0

다음은 정기적으로 사용하는 코드입니다.

from bs4 import BeautifulSoup
import urllib.request


def processText(webpage):

    # EMPTY LIST TO STORE PROCESSED TEXT
    proc_text = []

    try:
        news_open = urllib.request.urlopen(webpage.group())
        news_soup = BeautifulSoup(news_open, "lxml")
        news_para = news_soup.find_all("p", text = True)

        for item in news_para:
            # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
            para_text = (' ').join((item.text).split())

            # COMBINE LINES/PARAGRAPHS INTO A LIST
            proc_text.append(para_text)

    except urllib.error.HTTPError:
        pass

    return proc_text

도움이 되길 바랍니다.

— troymyname00
소스

0

LibreOffice의 작가 의견은 응용 프로그램이 파이썬 매크로를 사용할 수 있기 때문에 장점이 있습니다. 이 질문에 답하고 LibreOffice의 매크로 기반을 향상시키는 데 여러 가지 이점을 제공하는 것 같습니다. 이 해결 방법이 더 큰 프로덕션 프로그램의 일부로 사용되는 것이 아니라 일회성 구현 인 경우, 작성기에서 HTML을 열고 페이지를 텍스트로 저장하면 여기에서 논의 된 문제가 해결 될 것입니다.

— 1of7
소스

0

Perl way (죄송 합니다만, 절대 프로덕션에서는하지 않겠습니다).

import re

def html2text(html):
    res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
    res = re.sub('\n+', '\n', res)
    res = re.sub('\r+', '', res)
    res = re.sub('[\t ]+', ' ', res)
    res = re.sub('\t+', '\t', res)
    res = re.sub('(\n )+', '\n ', res)
    return res

— Brunql
소스

이것은 예를 들어 많은 이유는 나쁜 관행입니다 

— 열린 우리당 고렌

예! 사실이야! 어쨌든하지 마십시오!

— brunql