Python 将 html 转换为文本并模拟格式 [英] Python convert html to text and mimic formatting
问题描述
我正在学习 BeautifulSoup,并找到了许多html2text"解决方案,但我正在寻找的解决方案应该模仿格式:
I'm learning BeautifulSoup, and found many "html2text" solutions, but the one i'm looking for should mimic the formatting:
<ul>
<li>One</li>
<li>Two</li>
</ul>
会变成
* One
* Two
和
Some text
<blockquote>
More magnificent text here
</blockquote>
Final text
到
Some text
More magnificent text here
Final text
我正在阅读文档,但没有直接看到任何内容.有什么帮助吗?我愿意使用beautifulsoup以外的其他东西.
I'm reading the docs, but I'm not seeing anything straight forward. Any help? I'm open to using something other than beautifulsoup.
推荐答案
看看 Aaron Swartz 的 html2text 脚本(可以使用 pip install html2text
安装).请注意,输出是有效的 Markdown.如果由于某种原因不完全适合您,一些相当微不足道的调整应该可以让您获得问题中的确切输出:
Take a look at Aaron Swartz's html2text script (can be installed with pip install html2text
). Note that the output is valid Markdown. If for some reason that doesn't fully suit you, some rather trivial tweaks should get you the exact output in your question:
In [1]: import html2text
In [2]: h1 = """<ul>
...: <li>One</li>
...: <li>Two</li>
...: </ul>"""
In [3]: print html2text.html2text(h1)
* One
* Two
In [4]: h2 = """<p>Some text
...: <blockquote>
...: More magnificent text here
...: </blockquote>
...: Final text</p>"""
In [5]: print html2text.html2text(h2)
Some text
> More magnificent text here
Final text
这篇关于Python 将 html 转换为文本并模拟格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!