python xml.etree.ElementTree删除文本中间的空标签 [英] python xml.etree.ElementTree remove empty tag in the middle of text
本文介绍了python xml.etree.ElementTree删除文本中间的空标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个xml文档,我想从中提取基于标签的文本.
我要从中提取文本的部分看起来像这样:
I have an xml document from which I want to extract text based on tags.
The part that I want to extract text from looks something like this :
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
我这样做
tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
texte = text.text
我只能抓住空标签< TIP CONTENT ="/>
之前的部分我尝试先删除此标签,然后再获取其余文本.
我做到了:
I'm only able to grab the part that comes before the empty tag <TIP CONTENT=""/>
I tried to delete this tag before getting the rest of the text.
I did :
emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
root.remove(e)
但这不起作用.< BlockText>
和< TIP>
都不是root的直接子代.
谢谢.
But this is not working.
None of <BlockText>
and <TIP>
are direct children of root.
Thank you.
推荐答案
另一种解决方案,仅供参考
Another solution for reference only
from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))
结果:
{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']
这篇关于python xml.etree.ElementTree删除文本中间的空标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文