从XML文件创建字典 [英] Creating dictionary from XML file

查看:234
本文介绍了从XML文件创建字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有和 XML 文件,如下所示:

  <?xml version =1.0encoding =utf8?> 
< rebase>
<有机体>
< Name> Aminomonas paucivorans< / Name>
<酶> M1.Apa12260I< / Enzyme>
<主题> GGAGNNNNNGGC< / Motif>
<酶> M2.Apa12260I< / Enzyme>
< Motif> GGAGNNNNNGGC< / Motif>
< / Organism>
<有机体>
<名称> Bacillus cellulosilyticus< / Name>
<酶> M1.BceNI< / Enzyme>
<主题> CCCNNNNNCTC< / Motif>
<酶> M2.BceNI< / Enzyme>
<主题> CCCNNNNNCTC< / Motif>
< / Organism>

对于每个生物体有多个 Enzymes Motifs 。酶是独特的,但是图案可以重复。所以我试图用酶作为关键词和主题作为价值创建一个字典。这是我的代码:

  import xml.etree.ElementTree as ET 

def lister():
tree = ET.parse('rebase.xml')
rebase = tree.getroot()

data_dict = {}

for each_organism rebase.findall('Organism'):
try:
enzyme = each_organism.find('Enzyme')。text
除了AttributeError:
continue

为每个_organism.findall('Motif')中的主题:
motif = motif.text
data_dict [enzyme] =主题
返回data_dict
/ pre>

但是,字典似乎省略了不少条目。我似乎明白了什么问题。任何帮助将不胜感激。



编辑



但是删除它,但是我可以及时复制它:

  for rebase.findall('Organism')中的每个组织:
try:
enzyme = each_organism.find('Enzyme')。text
except AttributeError:
continue
data_dict [enzyme] = []
为主题in each_organism.findall('Motif'):
data_dict [enzyme] .append(motif.text)
return data_dict

然而,在这种情况下返回的字典是错误的,并且为何如此:



酶 - 基序对是唯一的。使1个酶只有1个基序。通过我的文件,酶只发生一次,一个主题可以多次发生,但它属于不同的酶,所以这对是唯一的。 编辑下的代码是这样的:



假设和酶 - M.APaI 与主题 GATC 另一个 M.APaII 与主题 TCAG 。两种酶都很相似(仅在最后一个字符 I 中区分)。该代码将两个基序绑定到第一个酶,产生 {M.ApaI:['GATC','TCAG']}

解决方案

我看到的第一个大问题是你只是在任何给定的生物体中搜索第一个酶。如果您想查找酶的每个发生率,您应该使用:

 在each_organism.findall('Enzyme')中的酶: 
#添加到这里的字典

第二个问题是,XML的格式不t匹配您似乎正在与您的字典建立的数据关系。在XML中,酶,基序和名称都是生物体的孩子,但是您将主题分配为与酶键相关的值。您无法知道,必须在重复通过事件发生的时候和哪一个应该与另一个相关联,因为它们都被卡在一起而没有任何逻辑上的分离。


$ b $我可能会误会你的目的,但是似乎你会更好地通过构建Organism和Enzyme类对象,而不是强制两个(显然)不相关的概念到一个关键价值关系。



这可能看起来像这样,并封装了你的字段:

 类有机体:
#其中酶是酶的一个迭代
def __init __(self,name,enzymes):
self.name = name
self.enzymes = enzymes

和您的Enzyme对象:

 code> class Enzyme:
#其中图案是字符串的一个可迭代的
def __init __(self,motifs):
self.motifs = motifs

所有这些都将需要您的XML文件进行某种更改。除非你只是按行解析(这显然不是XML的),我不能想到任何简单的方法,你可以找出哪个Motif属于哪个酶现在。



编辑:看到正在询问如何通过每个Enzyme节点相当盲目地迭代,并假设您总是有一个Name元素,每个Enzyme都有一个Motif, Name之后的每个元素都是Enzymes然后Motif(例如EMEM等),你应该可以这样做:

  i = 0 
enzymes = []
motifs = []

每个元素中的元素:
#跳过第一个名字child
if i == 0:
continue
#如果我们在一个奇数的索引,表示酶
如果我%2 == 1:
enzymes.append(element.text)
#if我们是一个均匀的索引,表示相关的主题
elif i%2 == 0:
motifs.append(element.text)

i + = 1

然后,假设我假设了每个假设,可能还有一个(我甚至不是100%肯定etree总是迭代元素自上而下),保持真实,任何基序中任何给定指数的基序将属于酶中相同指数的酶。如果我还没有明确表示:这是非常脆弱的代码。


I have and XML file that looks like this:

<?xml version="1.0" encoding ="utf8"?>
<rebase>
  <Organism>
    <Name>Aminomonas paucivorans</Name>
      <Enzyme>M1.Apa12260I</Enzyme>
        <Motif>GGAGNNNNNGGC</Motif>
      <Enzyme>M2.Apa12260I</Enzyme>
        <Motif>GGAGNNNNNGGC</Motif>
  </Organism>
  <Organism>
    <Name>Bacillus cellulosilyticus</Name>
      <Enzyme>M1.BceNI</Enzyme>
        <Motif>CCCNNNNNCTC</Motif>
      <Enzyme>M2.BceNI</Enzyme>
        <Motif>CCCNNNNNCTC</Motif>
  </Organism>

For each Organism there are multiple Enzymes and Motifs. Enzymes are unique but motifs can repeat. So I tried to create a dictionary with the enzyme as the key and the motif as the value. This is my code:

    import xml.etree.ElementTree as ET

    def lister():
        tree = ET.parse('rebase.xml')
        rebase = tree.getroot()

        data_dict = {}

        for each_organism in rebase.findall('Organism'):
            try:
                enzyme = each_organism.find('Enzyme').text
            except AttributeError:
                continue

            for motif in each_organism.findall('Motif'):
                motif = motif.text
                data_dict[enzyme] = motif
        return data_dict

However the dictionary seems to have omitted quite a few entries. I can seem to understand whats the issue. Any help will be appreciated.

EDIT

A user posted a solution , but then deleted it , however I could copy it in time:

for each_organism in rebase.findall('Organism'):
        try:
            enzyme = each_organism.find('Enzyme').text
        except AttributeError:
            continue
        data_dict[enzyme] = []
        for motif in each_organism.findall('Motif'):
            data_dict[enzyme].append(motif.text)
    return data_dict

However the dictionry returned in this case is wrong and heres why:

An enzyme - motif pair is unique. Such that 1 enzyme has 1 motif only. Through out my file an enzyme occurs only once, a motif can occur multiple times but it belongs to a different enzyme , so the pair is unique. What the code under EDIT does is this:

Assume and enzyme - M.APaI with motif GATC and another one M.APaII with motif TCAG. Both enzymes are pretty similar (differind only in the last character I). The code binds both motifs to the 1st enzyme creating {M.ApaI :['GATC','TCAG']}

解决方案

The first big problem I see is that you're only searching for the first Enzyme within any given Organism. If you wanted to find each incidence of Enzyme, you should use:

 for enzyme in each_organism.findall('Enzyme'):
     # add to dictionary here

The second problem is that the format of your XML doesn't match the data relations you seem to be building with your dictionary. Within the XML, Enzyme, Motif, and Name are all children of Organism, but you're assigning motif as a value associated with the enzyme key. You have no way of knowing, necessarily, when iterating through incidences of and which one should be associated with the other, because they're all jammed together without any logical separation in the object.

I could be misunderstanding your purpose here, but it seems like you'd be better served by constructing Organism and Enzyme class objects rather than to force two (apparently) unrelated concepts into a key-value relationship.

This could look like so, and encapsulate your fields:

class Organism:
    # where enzymes is an iterable of Enzyme
    def __init__(self, name, enzymes):
        self.name = name
        self.enzymes = enzymes

and your Enzyme object:

class Enzyme:
    # where motifs is an iterable of string
    def __init__(self, motifs):
        self.motifs = motifs

All this would still require some sort of change in your XML file. Unless you just parse it by line (which is decidedly not the point of XML), I can't think of any easy ways you'd be able to figure out which Motifs belong to which Enzyme right now.

Edit: seeing as you're asking about ways to just iterate fairly blindly through each Enzyme node, and assuming that you always have a single Name element, that you have one Motif for each Enzyme, and every element after Name is Enzymes then Motif (e.g. E-M-E-M etc.) you should be able to do this:

i = 0
enzymes = []
motifs = []

for element in each_organism:
    # skip the first Name child
    if i == 0:
        continue
    # if we're at an odd index, indicating an enzyme
    if i % 2 == 1:
        enzymes.append(element.text)
    # if we're at an even index, indicating the related motif
    elif i % 2 == 0:
        motifs.append(element.text)

    i += 1

Then, presuming every assumption I laid out, and probably a couple more (I'm not even 100% sure etree always iterates elements top-down), hold true, any motif at any given index in motifs will belong to the enzyme at the same index in enzymes. In case I haven't already made it clear: this is incredibly brittle code.

这篇关于从XML文件创建字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆