使用python中的xml树将嵌套的XML内容转换为CSV [英] Convert nested XML content into CSV using xml tree in python

查看:80
本文介绍了使用python中的xml树将嵌套的XML内容转换为CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手,请一视同仁.当我尝试将XML内容转换为字典列表"时,我得到了输出,但没有达到预期的效果,并且尝试了很多.

I'm very new to python and please treat me as same. When i tried to convert the XML content into List of Dictionaries I'm getting output but not as expected and tried a lot playing around.

XML内容

<project>
<data>
    <row>
        <respondent>m0wxo5f6w42h3fot34m7s6xij</respondent>
        <timestamp>10-06-16 11:30</timestamp>
        <product>1</product>
        <replica>1</replica>
        <seqnr>1</seqnr>
        <session>1</session>
        <column>
            <question>Q1</question>
            <answer>a1</answer>
        </column>
        <column>
            <question>Q2</question>
            <answer>a2</answer>
        </column>
    </row>
<row>
        <respondent>w42h3fot34m7s6x</respondent>
        <timestamp>10-06-16 11:30</timestamp>
        <product>1</product>
        <replica>1</replica>
        <seqnr>1</seqnr>
        <session>1</session>
        <column>
            <question>Q3</question>
            <answer>a3</answer>
        </column>
        <column>
            <question>Q4</question>
            <answer>a4</answer>
        </column>
    <column>
            <question>Q5</question>
            <answer>a5</answer>
        </column>
    </row>
</data>
</project>

我使用的代码:

import xml.etree.ElementTree as ET

tree = ET.parse(xml_file.xml)   # import xml from
root = tree.getroot()  
data_list = []

for item in root.find('./data'):    # find all projects node
  data = {}              # dictionary to store content of each projects
  for child in item:
    data[child.tag] = child.text   # add item to dictionary

#-----------------for loop with subchild is not working as expcted in my case
    for subchild in child:
      data[subchild.tag] = subchild.text
      data_list.append(data)
print(data_list)

headers = {k for d in data_list for k in d.keys()} # headers for csv 
with open(csv_file,'w') as f:
    writer = csv.DictWriter(f, fieldnames = headers)    # creating a DictWriter object
    writer.writeheader()    # write headers to csv
    writer.writerows(data_list)

data_list的输出正在将问题的最后一个信息添加到词典列表中. 我想问题出在子子forloop上,但是我不明白如何用字典附加列表.

Output for the data_list is getting the last info of question to the list of dictionaries. i guess the issue is at subchild forloop but im not understanding how to append the list with dictionaries.

[{
  'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
  'timestamp': '10-06-16 11:30',
  'product': '1',
  'replica': '1',
  'seqnr': '1',
  'session': '1',
  'column': '\n  ,
  'question': 'Q2',
  'answer': 'a2'
},
{
'respondent': 'w42h3fot34m7s6x',
  'timestamp': '10-06-16 11:30',
  'product': '1',
  'replica': '1',
  'seqnr': '1',
  'session': '1',
  'column': '\n ,
  'question': 'Q2',
  'answer': 'a2'
}.......
]

我期望下面的输出,尝试了很多,但是无法循环显示列标记.

I expect the below output, tried a lot but unable to loop over the column tag.

[{
    'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
    'timestamp': '10-06-16 11:30',
    'product': '1',
    'replica': '1',
    'seqnr': '1',
    'session': '1',
    'question': 'Q1',
    'answer': 'a1'
  },
  {
    'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
    'timestamp': '10-06-16 11:30',
    'product': '1',
    'replica': '1',
    'seqnr': '1',
    'session': '1',
    'question': 'Q2',
    'answer': 'a2'
  },
  {
    'respondent': 'w42h3fot34m7s6x',
    'timestamp': '10-06-16 11:30',
    'product': '1',
    'replica': '1',
    'seqnr': '1',
    'session': '1',
    'question': 'Q3',
    'answer': 'a3'
  },
  {
    'respondent': 'w42h3fot34m7s6x',
    'timestamp': '10-06-16 11:30',
    'product': '1',
    'replica': '1',
    'seqnr': '1',
    'session': '1',
    'question': 'Q4',
    'answer': 'a4'
  },
  {
    'respondent': 'w42h3fot34m7s6x',
    'timestamp': '10-06-16 11:30',
    'product': '1',
    'replica': '1',
    'seqnr': '1',
    'session': '1',
    'question': 'Q5',
    'answer': 'a5'
  }
]

我在xml树上引用了很多堆栈溢出问题,但仍然没有帮助我.

I have refereed so many stack overflow questions on xml tree but still didn't helped me.

感谢任何帮助/建议.

推荐答案

我在理解此代码应该执行的操作时遇到了问题,因为它使用了诸如itemchildsubchild之类的抽象变量名称,这使得很难对代码进行推理.我不是那么聪明,所以我将变量重命名为rowtagcolumn,以使我更容易看到代码在做什么. (在我的书中,甚至 row column 都有些抽象,但是我认为XML输入的不透明性几乎不是您的错.)

I had a problem understanding what this code is supposed to do because it uses abstract variable names like item, child, subchild and this makes it hard to reason about the code. I'm not as clever as that, so I renamed the variables to row, tag, and column to make it easier for me to see what the code is doing. (In my book, even row and column are a bit abstract, but I suppose the opacity of the XML input is hardly your fault.)

您有2行,但是您需要5个字典,因为您有5个<column>标记,并且每个<column>的数据都放在单独的字典中.但是您希望<row>中的 other 标签与每个<column>的数据一起重复.

You have 2 rows but you want 5 dictionaries, because you have 5 <column> tags and you want each <column>'s data in a separate dictionary. But you want the other tags in the <row> to be repeated along with each <column>'s data.

这意味着您需要为每个<row>建立一个字典,然后为每个<column> add 该列的数据添加到字典中,然后在继续下一列之前将其输出.

That means you need to build a dictionary for every <row>, then, for each <column>, add that column's data to the dictionary, then output it before going on to the next column.

此代码简化了一个假设,即所有<columns>都具有相同的结构,只有一个<question>和一个<answer>,而没有其他.如果此假设不成立,则可能会报告<column>及其从同一行中的先前<column>继承的陈旧数据.对于任何没有至少一个<column><row>,它也将完全不产生输出.

This code makes the simplifying assumption that all of your <columns>s have the same structure, with exactly one <question> and exactly one <answer> and nothing else. If this assumption does not hold then a <column> may get reported with stale data it inherited from the previous <column> in the same row. It will also produce no output at all for any <row> that does not have at least one <column>.

代码必须循环遍历标签两次,一次遍历非<column>,一次遍历<column>.否则,在开始输出<column>之前无法确定是否已看到所有非<column>标记.

The code has to loop through the tags twice, once for the non-<column>s and once for the <column>s. Otherwise it can't be sure it has seen all the non-<column> tags before it starts outputting the <column>s.

还有其他(无疑是更优雅的)方法,但是我使代码结构尽可能地接近您的原始结构,除了使变量名不透明之外.

There are other (no doubt more elegant) ways to do this, but I kept the code structure as close to your original as I could, other than making the variable names less opaque.

for row in root.find('./data'):    # find all projects node
    data = {}              # dictionary to store content of each projects
    for tag in row:
        if tag.tag != "column":
            data[tag.tag] = tag.text   # add row to dictionary
    # Now the dictionary data is built for the row level
    for tag in row:
        if tag.tag == "column":
            for column in tag:
                data[column.tag] = column.text
            # Now we have added the column level data for one column tag
            data_list.append(data.copy())

输出如下.字典的关键顺序没有保留,因为我使用pprint.pprint是为了方便.

Output is as below. The key order of the dicts isn't preserved because I used pprint.pprint for convenience.

[{'answer': 'a1',
  'product': '1',
  'question': 'Q1',
  'replica': '1',
  'respondent': 'm0wxo5f6w42h3fot34m7s6xij',
  'seqnr': '1',
  'session': '1',
  'timestamp': '10-06-16 11:30'},
 {'answer': 'a2',
  'product': '1',
  'question': 'Q2',
  'replica': '1',
  'respondent': 'm0wxo5f6w42h3fot34m7s6xij',
  'seqnr': '1',
  'session': '1',
  'timestamp': '10-06-16 11:30'},
 {'answer': 'a3',
  'product': '1',
  'question': 'Q3',
  'replica': '1',
  'respondent': 'w42h3fot34m7s6x',
  'seqnr': '1',
  'session': '1',
  'timestamp': '10-06-16 11:30'},
 {'answer': 'a4',
  'product': '1',
  'question': 'Q4',
  'replica': '1',
  'respondent': 'w42h3fot34m7s6x',
  'seqnr': '1',
  'session': '1',
  'timestamp': '10-06-16 11:30'},
 {'answer': 'a5',
  'product': '1',
  'question': 'Q5',
  'replica': '1',
  'respondent': 'w42h3fot34m7s6x',
  'seqnr': '1',
  'session': '1',
  'timestamp': '10-06-16 11:30'}]

这篇关于使用python中的xml树将嵌套的XML内容转换为CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆