使用 Python 将 HTML 渲染为纯文本 [英] Rendered HTML to plain text using Python
问题描述
我正在尝试使用 BeautifulSoup 转换一段 HTML 文本.下面是一个例子:
<p>一些文字<span>更多文字</span>更多文字</p><ul><li>列表项</li><li>又一个列表项</li><p>其他一些文字</p><ul><li>列表项</li><li>又一个列表项</li>
我尝试做类似的事情:
def parse_text(contents_string)换行符 = re.compile(r'[
]s+')bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)txt = bs.getText('
')return Newlines.sub('
', txt)
...但那样我的跨度元素总是在一个新行上.这当然是一个简单的例子.有没有办法在 HTML 页面中获取文本作为它在浏览器中呈现的方式(不需要 css 规则,只是呈现 div、span、li 等元素的常规方式)在 Python 中?
BeautifulSoup 是一个抓取库,因此它可能不是进行 HTML 渲染的最佳选择.如果不是必须使用 BeautifulSoup,您应该查看 html2text
.例如:
导入 html2texthtml = open("foobar.html").read()打印 html2text.html2text(html)
输出:
<前>一些文字更多文字更多文字* 项目清单* 另一个列表项其他一些文字* 项目清单* 另一个列表项I'm trying to convert a chunk of HTML text with BeautifulSoup. Here is an example:
<div>
<p>
Some text
<span>more text</span>
even more text
</p>
<ul>
<li>list item</li>
<li>yet another list item</li>
</ul>
</div>
<p>Some other text</p>
<ul>
<li>list item</li>
<li>yet another list item</li>
</ul>
I tried doing something like:
def parse_text(contents_string)
Newlines = re.compile(r'[
]s+')
bs = BeautifulSoup.BeautifulSoup(contents_string, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
txt = bs.getText('
')
return Newlines.sub('
', txt)
...but that way my span element is always on a new line. This is of course a simple example. Is there a way to get the text in the HTML page as the way it will be rendered in the browser (no css rules required, just the regular way div, span, li, etc. elements are rendered) in Python?
BeautifulSoup is a scraping library, so it's probably not the best choice for doing HTML rendering. If it's not essential to use BeautifulSoup, you should take a look at html2text
. For example:
import html2text
html = open("foobar.html").read()
print html2text.html2text(html)
This outputs:
Some text more text even more text * list item * yet another list item Some other text * list item * yet another list item
这篇关于使用 Python 将 HTML 渲染为纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!