获取lxml中标签内的所有文本 [英] Get all text inside a tag in lxml
本文介绍了获取lxml中标签内的所有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想编写一个代码片段,在下面的所有三个实例中(包括代码标签),它将在lxml中的<content>
标签内捕获所有文本.我已经尝试过tostring(getchildren())
,但是那样会错过标记之间的文本.我没有太多运气在API中搜索相关功能.你能帮我吗?
I'd like to write a code snippet that would grab all of the text inside the <content>
tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren())
but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?
<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>
<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"
<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"
推荐答案
尝试:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
示例:
from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)
产生:'\nText outside tag <div>Text <em>inside</em> tag</div>\n'
这篇关于获取lxml中标签内的所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文