呈现的HTML来使用Python纯文本 [英] Rendered HTML to plain text using Python

查看:133
本文介绍了呈现的HTML来使用Python纯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想转换成HTML文本与BeautifulSoup的一大块。下面是一个例子:

I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:

<div>
    <p>
        Some text
        <span>more text</span>
        even more text
    </p>
    <ul>
        <li>list item</li>
        <li>yet another list item</li>
    </ul>
</div>
<p>Some other text</p>
<ul>
    <li>list item</li>
    <li>yet another list item</li>
</ul>

我试图做这样的事情:

I tried doing something like:

def parse_text(contents_string)
    Newlines = re.compile(r'[\r\n]\s+')
    bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    txt = bs.getText('\n')
    return Newlines.sub('\n', txt)

...但这种方式我跨度元素总是在新的一行。当然,这是一个简单的例子。有没有办法让HTML页面,因为它会在浏览器中呈现方式的文本(无需CSS规则,仅仅是常规方式的div,跨度,李等元素呈现)在Python?

...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?

推荐答案

BeautifulSoup是一个刮库,所以它可能不是做HTML渲染的最佳选择。如果不是必须使用BeautifulSoup,你应该看看 html2text 。例如:

BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text. For example:

import html2text
html = open("foobar.html").read()
print html2text.html2text(html)

此输出:


Some text more text even more text

  * list item
  * yet another list item

Some other text

  * list item
  * yet another list item

这篇关于呈现的HTML来使用Python纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆