获取 lxml 中标签内的所有文本 [英] Get all text inside a tag in lxml
本文介绍了获取 lxml 中标签内的所有文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想编写一个代码片段,它可以在 lxml 中获取 <content>
标签内的所有文本,在下面的所有三个实例中,包括代码标签.我试过 tostring(getchildren())
但这会错过标签之间的文本.我在 API 中搜索相关函数的运气并不好.你能帮我吗?
<内容><div>标签内的文本</div></内容>#should return "<div>Text inside tag</div><!--2--><内容>没有标签的文本</内容>#应该返回没有标签的文本"<!--3--><内容>标签外的文字<div>标签内的文字</div></内容>#应该返回标签外的文本<div>标签内的文本</div>"
解决方案
尝试:
def stringify_children(node):从 lxml.etree 导入到字符串从 itertools 导入链部分 = ([节点.文本] +list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +[节点.尾])# 过滤器去除文本和尾部可能的无返回''.join(过滤器(无,部分))
示例:
from lxml import etreenode = etree.fromstring("""<内容>标签外的文本<div>文本<em>内</em>标签
</content>""")stringify_children(节点)
产生:'
Text external tag
Text inside</em>;标签
'
I'd like to write a code snippet that would grab all of the text inside the <content>
tag, in lxml, in all three instances below, including the code tags. I've tried tostring(getchildren())
but that would miss the text in between the tags. I didn't have very much luck searching the API for a relevant function. Could you help me out?
<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>
<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"
<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"
解决方案
Try:
def stringify_children(node):
from lxml.etree import tostring
from itertools import chain
parts = ([node.text] +
list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
[node.tail])
# filter removes possible Nones in texts and tails
return ''.join(filter(None, parts))
Example:
from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)
Produces: '
Text outside tag <div>Text <em>inside</em> tag</div>
'
这篇关于获取 lxml 中标签内的所有文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文