使用 lxml etree 将 html 标签打印为字符串 [英] Print html tags as string using lxml etree
问题描述
我想像soup.find_all()那样整体打印标签,但使用lxml etree.在 lxml 中它打印出标签名称而不是我想用于比较目的的整个标签.谢谢.
I want to print the tag as a whole just like soup.find_all() does but using lxml etree. In lxml it prints out the tag name instead of whole tag which I want to use for comparison purposes. Thank you.
代码:
from bs4 import BeautifulSoup
from lxml import etree
doc = "<p><a></a><a></a>Printable Text"
soup = BeautifulSoup(doc, "lxml")
root = etree.fromstring(str(soup))
tree = etree.ElementTree(root)
for e in tree.iter():
print(e.tag)
print("--------------")
输出:
html
--------------
body
--------------
p
--------------
a
--------------
a
--------------
预期输出:
<html><body><p><a></a><a></a>Printable Text</p></body></html>
--------------
<body><p><a></a><a></a>Printable Text</p></body>
--------------
<p><a></a><a></a>Printable Text</p>
--------------
<a></a>
--------------
<a></a>
--------------
推荐答案
您实际上并不需要解析您的 doc
(请注意,在您的问题中,您没有包含结束的 <p>
标签)和 beautifulsoup,然后用 lxml 解析汤,最后用 ElementTree 包裹它.但是如果你想/需要坚持这一点,你可以通过改变你的 for
循环来接近(但不是 100%)你的预期输出
You don't really need to parse your doc
(note that in your question you failed to include the closing <p>
tag) with beautifulsoup, then parse the soup with lxml, and finally wrap that with ElementTree. But if you want/need to stick to that, you can get close (but not 100%) to your expected output by changing your for
loop from
for e in tree.iter():
print(e.tag)
to(如@mzjn 在评论中提到的):
to (as mentioned by @mzjn in the comment):
for e in tree.iter():
print(etree.tostring(e).decode())
如果您想/可以跳过 ElementTree 步骤,您可以使用 xpath 获得相同的输出:
If you want/can skip the ElementTree step, you can get the same output by using xpath:
for e in root.xpath('//*'):
print(etree.tostring(e).decode())
无论哪种情况,输出都是
In either case, the output is
<html><body><p><a/><a/>Printable Text</p></body></html>
<body><p><a/><a/>Printable Text</p></body>
<p><a/><a/>Printable Text</p>
<a/>
<a/>Printable Text
如果您可以/想要完全跳过 lxml 部分,您可以通过使用 css 选择器直接从汤中打印来获得准确的预期输出:
If you can/want to skip the lxml part altogether, you can get to your exact expected output by printing directly from the soup with css selectors:
for s in soup.select('*'):
print(s)
输出:
<html><body><p><a></a><a></a>Printable Text</p></body></html>
<body><p><a></a><a></a>Printable Text</p></body>
<p><a></a><a></a>Printable Text</p>
<a></a>
<a></a>
这篇关于使用 lxml etree 将 html 标签打印为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!