使用python中的xml树将嵌套的XML内容转换为CSV [英] Convert nested XML content into CSV using xml tree in python
问题描述
我是python的新手,请一视同仁.当我尝试将XML内容转换为字典列表"时,我得到了输出,但没有达到预期的效果,并且尝试了很多.
I'm very new to python and please treat me as same. When i tried to convert the XML content into List of Dictionaries I'm getting output but not as expected and tried a lot playing around.
XML内容
<project>
<data>
<row>
<respondent>m0wxo5f6w42h3fot34m7s6xij</respondent>
<timestamp>10-06-16 11:30</timestamp>
<product>1</product>
<replica>1</replica>
<seqnr>1</seqnr>
<session>1</session>
<column>
<question>Q1</question>
<answer>a1</answer>
</column>
<column>
<question>Q2</question>
<answer>a2</answer>
</column>
</row>
<row>
<respondent>w42h3fot34m7s6x</respondent>
<timestamp>10-06-16 11:30</timestamp>
<product>1</product>
<replica>1</replica>
<seqnr>1</seqnr>
<session>1</session>
<column>
<question>Q3</question>
<answer>a3</answer>
</column>
<column>
<question>Q4</question>
<answer>a4</answer>
</column>
<column>
<question>Q5</question>
<answer>a5</answer>
</column>
</row>
</data>
</project>
我使用的代码:
import xml.etree.ElementTree as ET
tree = ET.parse(xml_file.xml) # import xml from
root = tree.getroot()
data_list = []
for item in root.find('./data'): # find all projects node
data = {} # dictionary to store content of each projects
for child in item:
data[child.tag] = child.text # add item to dictionary
#-----------------for loop with subchild is not working as expcted in my case
for subchild in child:
data[subchild.tag] = subchild.text
data_list.append(data)
print(data_list)
headers = {k for d in data_list for k in d.keys()} # headers for csv
with open(csv_file,'w') as f:
writer = csv.DictWriter(f, fieldnames = headers) # creating a DictWriter object
writer.writeheader() # write headers to csv
writer.writerows(data_list)
data_list的输出正在将问题的最后一个信息添加到词典列表中. 我想问题出在子子forloop上,但是我不明白如何用字典附加列表.
Output for the data_list is getting the last info of question to the list of dictionaries. i guess the issue is at subchild forloop but im not understanding how to append the list with dictionaries.
[{
'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'column': '\n ,
'question': 'Q2',
'answer': 'a2'
},
{
'respondent': 'w42h3fot34m7s6x',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'column': '\n ,
'question': 'Q2',
'answer': 'a2'
}.......
]
我期望下面的输出,尝试了很多,但是无法循环显示列标记.
I expect the below output, tried a lot but unable to loop over the column tag.
[{
'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q1',
'answer': 'a1'
},
{
'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q2',
'answer': 'a2'
},
{
'respondent': 'w42h3fot34m7s6x',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q3',
'answer': 'a3'
},
{
'respondent': 'w42h3fot34m7s6x',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q4',
'answer': 'a4'
},
{
'respondent': 'w42h3fot34m7s6x',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q5',
'answer': 'a5'
}
]
我在xml树上引用了很多堆栈溢出问题,但仍然没有帮助我.
I have refereed so many stack overflow questions on xml tree but still didn't helped me.
感谢任何帮助/建议.
推荐答案
我在理解此代码应该执行的操作时遇到了问题,因为它使用了诸如item
,child
,subchild
之类的抽象变量名称,这使得很难对代码进行推理.我不是那么聪明,所以我将变量重命名为row
,tag
和column
,以使我更容易看到代码在做什么. (在我的书中,甚至 row 和 column 都有些抽象,但是我认为XML输入的不透明性几乎不是您的错.)
I had a problem understanding what this code is supposed to do because it uses abstract variable names like item
, child
, subchild
and this makes it hard to reason about the code. I'm not as clever as that, so I renamed the variables to row
, tag
, and column
to make it easier for me to see what the code is doing. (In my book, even row and column are a bit abstract, but I suppose the opacity of the XML input is hardly your fault.)
您有2行,但是您需要5个字典,因为您有5个<column>
标记,并且每个<column>
的数据都放在单独的字典中.但是您希望<row>
中的 other 标签与每个<column>
的数据一起重复.
You have 2 rows but you want 5 dictionaries, because you have 5 <column>
tags and you want each <column>
's data in a separate dictionary. But you want the other tags in the <row>
to be repeated along with each <column>
's data.
这意味着您需要为每个<row>
建立一个字典,然后为每个<column>
add 该列的数据添加到字典中,然后在继续下一列之前将其输出.
That means you need to build a dictionary for every <row>
, then, for each <column>
, add that column's data to the dictionary, then output it before going on to the next column.
此代码简化了一个假设,即所有<columns>
都具有相同的结构,只有一个<question>
和一个<answer>
,而没有其他.如果此假设不成立,则可能会报告<column>
及其从同一行中的先前<column>
继承的陈旧数据.对于任何没有至少一个<column>
的<row>
,它也将完全不产生输出.
This code makes the simplifying assumption that all of your <columns>
s have the same structure, with exactly one <question>
and exactly one <answer>
and nothing else. If this assumption does not hold then a <column>
may get reported with stale data it inherited from the previous <column>
in the same row. It will also produce no output at all for any <row>
that does not have at least one <column>
.
代码必须循环遍历标签两次,一次遍历非<column>
,一次遍历<column>
.否则,在开始输出<column>
之前无法确定是否已看到所有非<column>
标记.
The code has to loop through the tags twice, once for the non-<column>
s and once for the <column>
s. Otherwise it can't be sure it has seen all the non-<column>
tags before it starts outputting the <column>
s.
还有其他(无疑是更优雅的)方法,但是我使代码结构尽可能地接近您的原始结构,除了使变量名不透明之外.
There are other (no doubt more elegant) ways to do this, but I kept the code structure as close to your original as I could, other than making the variable names less opaque.
for row in root.find('./data'): # find all projects node
data = {} # dictionary to store content of each projects
for tag in row:
if tag.tag != "column":
data[tag.tag] = tag.text # add row to dictionary
# Now the dictionary data is built for the row level
for tag in row:
if tag.tag == "column":
for column in tag:
data[column.tag] = column.text
# Now we have added the column level data for one column tag
data_list.append(data.copy())
输出如下.字典的关键顺序没有保留,因为我使用pprint.pprint
是为了方便.
Output is as below. The key order of the dicts isn't preserved because I used pprint.pprint
for convenience.
[{'answer': 'a1',
'product': '1',
'question': 'Q1',
'replica': '1',
'respondent': 'm0wxo5f6w42h3fot34m7s6xij',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'},
{'answer': 'a2',
'product': '1',
'question': 'Q2',
'replica': '1',
'respondent': 'm0wxo5f6w42h3fot34m7s6xij',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'},
{'answer': 'a3',
'product': '1',
'question': 'Q3',
'replica': '1',
'respondent': 'w42h3fot34m7s6x',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'},
{'answer': 'a4',
'product': '1',
'question': 'Q4',
'replica': '1',
'respondent': 'w42h3fot34m7s6x',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'},
{'answer': 'a5',
'product': '1',
'question': 'Q5',
'replica': '1',
'respondent': 'w42h3fot34m7s6x',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'}]
这篇关于使用python中的xml树将嵌套的XML内容转换为CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!