遍历python中XML标记中的所有子标记和字符串,而无需指定子标记名称 [英] Iterate through all sub-tags and strings from an XML tag in python, without specifying sub-tag name

查看:449
本文介绍了遍历python中XML标记中的所有子标记和字符串,而无需指定子标记名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是

My question is an add on from here, but I'm not meant to use the answer section for add-on questions.

如果我有这样的XML文件的一部分:

If I have part of an XML file like this:

  <eligibility>
    <criteria>
      <textblock>
        Inclusion Criteria:

          -  women undergoing cesarean section for any indication

          -  literate in german language

        Exclusion Criteria:

          -  history of keloids

          -  previous transversal suprapubic scars

          -  known patient hypersensitivity to any of the suture materials used in the protocol

          -  a medical disorder that could affect wound healing (eg, diabetes mellitus, chronic
             corticosteroid use)
      </textblock>
    </criteria>
    <gender>Female</gender>
    <minimum_age>18 Years</minimum_age>
    <maximum_age>45 Years</maximum_age>
    <healthy_volunteers>No</healthy_volunteers>
  </eligibility>

我想提取此合格性部分中的所有字符串(即文本块部分中的字符串以及性别,最小年龄,最大年龄和健康志愿者部分)

I want to pull out all of the strings in this eligibility section (i.e the string in the textblock section and the gender, minimum age, maximum age and healthy volunteers sections)

使用上面的代码,我做到了:

using the code above I did this:

import sys
from bs4 import BeautifulSoup

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'lxml')
eligibi = []

for eligibility in soup.find_all('eligibility'):
    d = {'other_name':eligibility.criteria.textblock.string, 'gender':eligibility.gender.string}
    eligibi.append(d)

print eligibi

我的问题是我有很多文件.有时XML文件的结构可能是:

My problem is I have many files. Sometimes the structure of the XML file might be:

eligibility -> criteria -> textblock -> text
eligibility -> other things (e.g. gender as above) -> text
eligibility -> text

例如 是否有办法采取采用所有副标题及其文本"

e.g. if there way to just take 'take all of the sub-headings and their texts'

因此在上面的示例中,列表/词典将包含: {条件文本块:包含和排除条件,性别:xxx,最小年龄:xxx,最大年龄:xxx,健康志愿者:xxx}

so in the above example, the list/dictionary would contain: {criteria textblock: inclusion and exclusion criteria, gender: xxx, minimum_age: xxx, maximum_age: xxx, healthy_volunteers: xxx}

我的问题是,实际上,我不会知道资格标签的所有特定子标签,因为每个实验都可能不同(例如,也许有人说接受孕妇",接受XXX的药物史" '等)

My problem is, in reality, I am not going to know all the specific sub-tags of the eligibility tag, as each experiment could be different (e.g. maybe some say 'pregnant women accepted', 'drug history of XXX accepted' etc)

所以我只想,如果给它一个标签名,它将给我字典中所有的子标签和这些子标签的文本.

So I just want, if I give it a tag name, it will give me all the sub-tags and text of those sub-tags in a dictionary.

扩展的XML以供注释:

Extended XML for comment:

<brief_title>Subcutaneous Adaption and Cosmetic Outcome Following Caesarean Delivery</brief_title>
<source>Klinikum Klagenfurt am Wörthersee</source>

...然后是上面的资格XML部分.

...and then the eligibility XML section above.

推荐答案

由于已安装lxml,因此可以尝试以下操作(此代码假定给定元素内的叶子元素,即eligibility是唯一的):

Since you have lxml installed you can try the following (this code assumes leaf elements within a given element i.e eligibility are unique) :

from lxml import etree
tree = etree.parse(sys.argv[1])
root = tree.getroot()

eligibi = []

for eligibility in root.xpath('//eligibility'):
    d = {}
    for e in eligibility.xpath('.//*[not(*)]'):
        d[e.tag] = e.text
    eligibi.append(d)

print eligibi

XPath说明:

  • .//* :查找当前eligibility中的所有元素,无论其深度(//)和标签名称(*)
  • [not(*)] :将前一位找到的过滤器元素过滤为没有任何子元素(也称为叶元素)的元素
  • .//* : find all elements within current eligibility, no matter its depth (//) and tag name (*)
  • [not(*)] : filter elements found by the previous bit to those that don't have any child element aka leaf elements

这篇关于遍历python中XML标记中的所有子标记和字符串,而无需指定子标记名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆