在Python中循环浏览XML [英] Loop through XML in Python
问题描述
我的数据集如下:
<?xml version="1.0" encoding="UTF-8"?>
<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false"
name = "some_name">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false"
name = "some_name">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
<currentowner col_four="00001bvalue"
col_five="00001bvalue"
col_six="00001bfalse"
name = "some_name">
<addr col_seven="00001bvalue"
col_eight="00001bvalue"
col_nine="00001bfalse"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value"
name = "some_name">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false"
name = "some_name">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>
目前,我有两个循环,一个循环访问 child
数据,另一个循环通过 granchild
Currently I have two loops, one iterates thourgh child
data, other through granchild
import pandas
import xml.etree.ElementTree as element_tree
from xml.etree.ElementTree import parse
tree = element_tree.parse('<HERE_GOES_XML>')
root = tree.getroot()
name_space = {'ns0': 'http://SOMELINK'}
#root
date_from = root.attrib['date']
print(date_from)
#child
for pharma in root.findall('.//ns0:dept', name_space):
for key, value in pharma.items():
print(key +': ' + value)
#granchild, this must be merged to above so entire script will iterate through entire dept node to move to the next
for owner in root.findall('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
owner_dict = {}
for key, value in owner.items():
print(key +': ' + value)
当前结果是:
2021-01-15
dept_id: 00001
col_two: 00001value
col_three: 00001false
dept_id: 00002
col_two: 00002value
col_three: 00002value
col_four: 00001value
col_five: 00001value
col_six: 00001false
col_four: 00002value
col_five: 00002value
col_six: 00002false
我的目标是嵌套外观,该外观首先将使整个 dept
子代及其子代对象迭代,然后再移动到下一个子代.预期结果将低于下面的设置,稍后将转换为 pandas'
数据框(我将在下一个尝试中进行操作).有些列在child/granchild之间具有相同的名称,因此将需要前缀或仅遍历特定的 children
.
I am aiming at nested look that will firstly iterate entire dept
child with its granchildren and only then move to the next one. Expected result would be below set to be later transformed into pandas'
dataframe (I will try to work on this next). Some columns have same name between child/granchild thus prefix would be required or looping through only specific children
.
dept.dept_id: 00001
dept.col_two: 00001value
dept.col_three: 00001false
dept.name: some_name
currentowner.col_four: 00001value
currentowner.col_five: 00001value
currentowner.col_six: 00001false
currentowner.name: some_name
currentowner.col_four: 00001bvalue
currentowner.col_five: 00001bvalue
currentowner.col_six: 00001bfalse
currentowner.name: some_name
addr.col_seven: 00001value
addr.col_eight: 00001value
addr.col_nine: 00001false
dept.dept_id: 00002
dept.col_two: 00002value
dept.col_three: 00002value
dept.name: some_name
currentowner.col_four: 00002value
currentowner.col_five: 00002value
currentowner.col_six: 00002false
currentowner.name: some_name
addr.col_seven: 00002value
addr.col_eight: 00002value
addr.col_nine: 00002false
[ UPDATE ]-我遇到了应该执行此操作的 zip
.
[UPDATE] - I came across zip
which should do the trick.
dept_list = []
for item in root.iterfind('.//ns0:dept', name_space):
#print(item.attrib)
dept_list.append(item.attrib)
#print(dept_list)
owner_list = []
for item in root.iterfind('.//ns0:dept/ns0:owners/ns0:currentowner', name_space):
#print(item.attrib)
owner_list.append(item.attrib)
#print(owner_list)
zipped = zip(dept_list, owner_list)
推荐答案
可以通过列表理解来完成循环,然后通过导航DOM来构建字典.以下代码直接进入数据帧.
Looping can be done in a list comprehension then building dict from navigating the DOM. Following code goes straight to a data frame.
xml = """<depts xmlns="http://SOMELINK"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
date="2021-01-15">
<dept dept_id="00001"
col_two="00001value"
col_three="00001false">
<owners>
<currentowner col_four="00001value"
col_five="00001value"
col_six="00001false">
<addr col_seven="00001value"
col_eight="00001value"
col_nine="00001false"/>
</currentowner>
</owners>
</dept>
<dept dept_id="00002"
col_two="00002value"
col_three="00002value">
<owners>
<currentowner col_four="00002value"
col_five="00002value"
col_six="00002false">
<addr col_seven="00002value"
col_eight="00002value"
col_nine="00002false"/>
</currentowner>
</owners>
</dept>
</depts>"""
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
root.attrib
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**d.attrib,
**d.find("ns0:owners/ns0:currentowner", ns).attrib,
**d.find("ns0:owners/ns0:currentowner/ns0:addr", ns).attrib}
for d in root.findall("ns0:dept", ns)
])
更安全的版本
如果任何部门没有使用 .attrib
的 currentowner 或 currentowner/addr ,将失败.考虑这些元素是可选的,请遍历DOM. dict
的键结构已更改为基于元素标签以及属性名称的名称.根据数据设计构建理解的方式.需要考虑1到1,1到可选,1到很多.确实可以追溯到 Codd 在1970年写的论文
safer version
if any dept had no currentowner or currentowner/addr using .attrib
would fail. Walk the DOM considering these elements to be optional. dict
keys construction changed to name based on tag of element as well as attribute name. Structure the way the comprehensions are structured based on your data design. Need to consider 1 to 1, 1 to optional, 1 to many. Really goes back to papers that Codd wrote in 1970
import xml.etree.ElementTree as ET
import pandas as pd
root = ET.fromstring(xml)
ns = {'ns0': 'http://SOMELINK'}
pd.DataFrame([{**{f"{d.tag.split('}')[1]}.{k}":v for k,v in d.items()},
**{f"{co.tag.split('}')[1]}.{k}":v for k,v in co.items()},
**{f"{addr.tag.split('}')[1]}.{k}":v for addr in co.findall("ns0:addr", ns) for k,v in addr.items()} }
for d in root.findall("ns0:dept", ns)
for co in d.findall("ns0:owners/ns0:currentowner", ns)
])
这篇关于在Python中循环浏览XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!