使用 lxml etree 将 html 标签打印为字符串 [英] Print html tags as string using lxml etree

查看:76
本文介绍了使用 lxml etree 将 html 标签打印为字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想像soup.find_all()那样整体打印标签,但使用lxml etree.在 lxml 中它打印出标签名称而不是我想用于比较目的的整个标签.谢谢.

I want to print the tag as a whole just like soup.find_all() does but using lxml etree. In lxml it prints out the tag name instead of whole tag which I want to use for comparison purposes. Thank you.

代码:

from bs4 import BeautifulSoup
from lxml import etree

doc = "<p><a></a><a></a>Printable Text"
soup = BeautifulSoup(doc, "lxml")
root = etree.fromstring(str(soup))

tree = etree.ElementTree(root)
for e in tree.iter():
    print(e.tag)
    print("--------------")

输出:

html
--------------
body
--------------
p
--------------
a
--------------
a
--------------

预期输出:

<html><body><p><a></a><a></a>Printable Text</p></body></html>
--------------
<body><p><a></a><a></a>Printable Text</p></body>
--------------
<p><a></a><a></a>Printable Text</p>
--------------
<a></a>
--------------
<a></a>
--------------

推荐答案

您实际上并不需要解析您的 doc(请注意,在您的问题中,您没有包含结束的 <p> 标签)和 beautifulsoup,然后用 lxml 解析汤,最后用 ElementTree 包裹它.但是如果你想/需要坚持这一点,你可以通过改变你的 for 循环来接近(但不是 100%)你的预期输出

You don't really need to parse your doc (note that in your question you failed to include the closing <p> tag) with beautifulsoup, then parse the soup with lxml, and finally wrap that with ElementTree. But if you want/need to stick to that, you can get close (but not 100%) to your expected output by changing your for loop from

for e in tree.iter():
    print(e.tag)
    

to(如@mzjn 在评论中提到的):

to (as mentioned by @mzjn in the comment):

for e in tree.iter():
    print(etree.tostring(e).decode())
    

如果您想/可以跳过 ElementTree 步骤,您可以使用 xpath 获得相同的输出:

If you want/can skip the ElementTree step, you can get the same output by using xpath:

for e in root.xpath('//*'):
    print(etree.tostring(e).decode())
    

无论哪种情况,输出都是

In either case, the output is

<html><body><p><a/><a/>Printable Text</p></body></html>
<body><p><a/><a/>Printable Text</p></body>
<p><a/><a/>Printable Text</p>
<a/>
<a/>Printable Text

如果您可以/想要完全跳过 lxml 部分,您可以通过使用 css 选择器直接从汤中打印来获得准确的预期输出:

If you can/want to skip the lxml part altogether, you can get to your exact expected output by printing directly from the soup with css selectors:

for s in soup.select('*'):
    print(s)

输出:

<html><body><p><a></a><a></a>Printable Text</p></body></html>
<body><p><a></a><a></a>Printable Text</p></body>
<p><a></a><a></a>Printable Text</p>
<a></a>
<a></a>

这篇关于使用 lxml etree 将 html 标签打印为字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆