与使用lxml.html解析HTML时的InnerHTML等效 [英] Equivalent to InnerHTML when using lxml.html to parse HTML

查看:146
本文介绍了与使用lxml.html解析HTML时的InnerHTML等效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用lxml.html解析网页的脚本.我当时做了很多BeautifulSoup,但由于速度快,现在正在尝试使用lxml.

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.

我想知道该库中最明智的方法是执行与Java的InnerHtml等效的操作,即检索或设置标签的完整内容.

I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.

<body>
<h1>A title</h1>
<p>Some text</p>
</body>

InnerHtml因此是:

InnerHtml is therefore:

<h1>A title</h1>
<p>Some text</p>

我可以使用黑客(转换为字符串/正则表达式等)来完成此操作,但是我假设有一种正确的方法可以使用由于不熟悉而丢失的库来执行此操作.感谢您的帮助.

I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.

感谢pobk如此迅速有效地向我展示了解决方法.对于任何尝试相同的人,这就是我最终得到的结果:

Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:

from lxml import html
from cStringIO import StringIO
t = html.parse(StringIO(
"""<body>
<h1>A title</h1>
<p>Some text</p>
Untagged text
<p>
Unclosed p tag
</body>"""))
root = t.getroot()
body = root.body
print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])

请注意,lxml.html解析器将修复未关闭的标签,因此请注意是否有问题.

Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.

推荐答案

您可以使用根节点的getchildren()或iterdescendants()方法获取ElementTree节点的子节点:

You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:

>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
...  print etree.tostring(child)
...
<h1>A title</h1>

<p>Some text</p>

这可以简写如下:

print ''.join([etree.tostring(child) for child in root.iterdescendants()])

这篇关于与使用lxml.html解析HTML时的InnerHTML等效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆