python xml.etree.ElementTree删除文本中间的空标签 [英] python xml.etree.ElementTree remove empty tag in the middle of text

查看:80
本文介绍了python xml.etree.ElementTree删除文本中间的空标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个xml文档,我想从中提取基于标签的文本.
我要从中提取文本的部分看起来像这样:

I have an xml document from which I want to extract text based on tags.
The part that I want to extract text from looks something like this :

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>

我这样做

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = text.text

我只能抓住空标签< TIP CONTENT ="/>
之前的部分我尝试先删除此标签,然后再获取其余文本.
我做到了:

I'm only able to grab the part that comes before the empty tag <TIP CONTENT="­"/>
I tried to delete this tag before getting the rest of the text.
I did :

emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
    root.remove(e)

但这不起作用.
< BlockText> < TIP> 都不是root的直接子代.


谢谢.

But this is not working.
None of <BlockText> and <TIP> are direct children of root.


Thank you.

推荐答案

另一种解决方案,仅供参考

Another solution for reference only

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

结果:

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']

这篇关于python xml.etree.ElementTree删除文本中间的空标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆