将层次结构(树状)的XML读取到pandas数据框中,以保持层次结构 [英] Read hierarchical (tree-like) XML into a pandas dataframe, preserving hierarchy
问题描述
我有一个XML文档,其中包含分层的树状结构,请参见下面的示例.
I have a XML document that contains a hierarchical, tree-like structure, see the example below.
文档包含几个<Message>
标记(为方便起见,我仅复制了其中一个).
The document contains several <Message>
tags (I only copied one of them for convenience).
每个<Message>
都有自己的一些关联数据(id
,status
,priority
).
Each <Message>
has some associated data (id
, status
, priority
) on its own.
此外,每个<Message>
可以包含一个或多个<Street>
子代,这些子代又具有一些相关数据(<name>
,<length>
).
Besides, each <Message>
can contain one or more <Street>
children which again have some relevant data (<name>
, <length>
).
此外,每个<Street>
可以有一个或多个<Link>
子代,这些子代又具有自己的相关数据(<id>
,<direction>
).
Moreover, each <Street>
can have one or more <Link>
children which again have their own relevant data (<id>
, <direction>
).
示例XML文档:
<?xml version="1.0" encoding="ISO-8859-1"?>
<Root xmlns="someNamespace">
<Messages>
<Message id='12345'>
<status>Active</status>
<priority>Low</priority>
<Area>
<Streets>
<Street>
<name>King Street</name>
<length>Short</length>
<Link>
<id>75838745</id>
<direction>North</direction>
</Link>
<Link>
<id>168745</id>
<direction>South</direction>
</Link>
<Link>
<id>975416</id>
<direction>North</direction>
</Link>
</Street>
<Street>
<name>Queen Street</name>
<length>Long</length>
<Link>
<id>366248</id>
<direction>West</direction>
</Link>
<Link>
<id>745812</id>
<direction>East</direction>
</Link>
</Street>
</Streets>
</Area>
</Message>
</Messages>
</Root>
使用Python解析XML并将相关数据存储在变量中不是问题-例如,我可以使用lxml
库并读取整个文档,然后执行一些xpath
表达式以获取相关字段,或使用iterparse
方法逐行读取.
Parsing the XML with Python and storing the relevant data in variables is not the problem - I can use for example the lxml
library and either read the whole document, then perform some xpath
expressions to get the relevant fields, or read it line by line with the iterparse
method.
但是,我想将数据放入熊猫数据框中,同时保留其中的层次结构.目标是查询单个消息(例如,通过if status == Active then get the Message with all its streets and its streets' links
之类的布尔表达式),并获取属于特定消息(其街道及其街道链接)的所有数据.最好怎么做?
However, I would like to put the data into a pandas dataframe while preserving the hierarchy in it. The goal is to query for single messages (e.g. by Boolean expressions like if status == Active then get the Message with all its streets and its streets' links
) and get all the data that belongs to the specific message (its streets and its streets' links). How would this best be done?
我尝试了不同的方法,但是都遇到了问题.
I tried different approaches but ran into problems with all of them.
如果我为每个包含信息的XML行创建一个数据框行,然后在[MessageID, StreetName, LinkID]
上设置一个MultiIndex,则会得到一个包含很多NaN
的索引(通常不建议这样做),因为MessageID
不会还知道它的子级streets
和links
.此外,我不知道如何通过布尔条件选择某些子数据集,而不是只获取一些没有子项的单行.
If I create one dataframe row for each XML row that contains information and then set a MultiIndex on [MessageID, StreetName, LinkID]
, I get an Index with lots of NaN
in it (which is generally discouraged) because MessageID
does not know its children streets
and links
yet. Besides, I would not know how to select some sub-dataset by Boolean condition instead of only getting some single rows without its children.
在[MessageID, StreetName, LinkID]
上执行GroupBy时,我不知道如何从pandas GroupBy对象取回(可能是MultiIndex)数据帧,因为这里没有要聚合的内容(没有均值/标准差/总和/无论如何,这些值应该保持不变).
When doing a GroupBy on [MessageID, StreetName, LinkID]
, I do not know how to get back a (probably MultiIndex) dataframe from the pandas GroupBy object since there is nothing to aggregate here (no mean/std/sum/whatsoever, the values should stay the same).
有人建议如何有效地处理此问题吗?
Any suggestions how this could be handled efficiently?
推荐答案
我终于设法解决了上述问题,这就是解决方法.
I finally managed to solve the problem as described above and this is how.
我将上述给定的XML文档扩展为包括两条消息,而不是一条消息.这就是有效的Python字符串的样子(当然也可以从文件中加载它):
I extended the above given XML document to include two messages instead of one. This is how it looks as a valid Python string (it could also be loaded from a file of course):
xmlDocument = '''<?xml version="1.0" encoding="ISO-8859-1"?> \
<Root> \
<Messages> \
<Message id='12345'> \
<status>Active</status> \
<priority>Low</priority> \
<Area> \
<Streets> \
<Street> \
<name>King Street</name> \
<length>Short</length> \
<Link> \
<id>75838745</id> \
<direction>North</direction> \
</Link> \
<Link> \
<id>168745</id> \
<direction>South</direction> \
</Link> \
<Link> \
<id>975416</id> \
<direction>North</direction> \
</Link> \
</Street> \
<Street> \
<name>Queen Street</name> \
<length>Long</length> \
<Link> \
<id>366248</id> \
<direction>West</direction> \
</Link> \
<Link> \
<id>745812</id> \
<direction>East</direction> \
</Link> \
</Street> \
</Streets> \
</Area> \
</Message> \
<Message id='54321'> \
<status>Inactive</status> \
<priority>High</priority> \
<Area> \
<Streets> \
<Street> \
<name>Princess Street</name> \
<length>Mid</length> \
<Link> \
<id>744154</id> \
<direction>West</direction> \
</Link> \
<Link> \
<id>632214</id> \
<direction>South</direction> \
</Link> \
<Link> \
<id>654785</id> \
<direction>East</direction> \
</Link> \
</Street> \
<Street> \
<name>Prince Street</name> \
<length>Very Long</length> \
<Link> \
<id>1022444</id> \
<direction>North</direction> \
</Link> \
<Link> \
<id>4474558</id> \
<direction>South</direction> \
</Link> \
</Street> \
</Streets> \
</Area> \
</Message> \
</Messages> \
</Root>'''
要将层次结构XML结构解析为一个扁平的pandas数据帧,我使用了Python的ElementTree iterparse
方法,该方法提供了类似于SAX的接口来逐行遍历XML文档,并在特定XML标签开始或结束时触发事件.
To parse the hierarchical XML structure into a flat pandas dataframe, I used Python's ElementTree iterparse
method which provides a SAX-like interface to iterate through a XML document line by line and fire events if specific XML tags start or end.
对于每个解析的XML行,给定的信息都存储在字典中.使用三个字典,每个字典以某种方式属于一起(消息,街道,链接),然后将其存储在其自己的数据帧行中.收集到这样一行的所有信息后,该字典将附加到以适当顺序存储所有行的列表中.
To each parsed XML line, the given information is stored in a dictionary. Three dictionaries are used, one for each set of data that somehow belongs together (message, street, link) and that is to be stored in its own dataframe row later on. When all information to one such row is collected, the dictionary is appended to a list storing all rows in their appropriate order.
这是XML解析的样子(有关更多说明,请参见内联注释):
This is what the XML parsing looks like (see inline comments for further explanation):
# imports
import xml.etree.ElementTree as ET
import pandas as pd
# initialize parsing from Bytes buffer
from io import BytesIO
xmlDocument = BytesIO(xmlDocument.encode('utf-8'))
# initialize dictionaries storing the information to each type of row
messageRow, streetRow, linkRow = {}, {}, {}
# initialize list that stores the single dataframe rows
listOfRows = []
# read the xml file line by line and throw signal when specific tags start or end
for event, element in ET.iterparse(xmlDocument, events=('start', 'end')):
##########
# get all information on the current message and store in the appropriate dictionary
##########
# get current message's id attribute
if event == 'start' and element.tag == 'Message':
messageRow = {} # re-initialize the dictionary for the current row
messageRow['messageId'] = element.get('id')
# get current message's status
if event == 'end' and element.tag == 'status':
messageRow['status'] = element.text
# get current message's priority
if event == 'end' and element.tag == 'priority':
messageRow['priority'] = element.text
# when no more information on the current message is expected, append it to the list of rows
if event == 'end' and element.tag == 'priority':
listOfRows.append(messageRow)
##########
# get all information on the current street and store in row dictionary
##########
if event == 'end' and element.tag == 'name':
streetRow = {} # re-initialize the dictionary for the current street row
streetRow['streetName'] = element.text
if event == 'end' and element.tag == 'length':
streetRow['streetLength'] = element.text
# when no more information on the current street is expected, append it to the list of rows
if event == 'end' and element.tag == 'length':
# link the street to the message it belongs to, then append
streetRow['messageId'] = messageRow['messageId']
listOfRows.append(streetRow)
##########
# get all information on the current link and store in row dictionary
##########
if event == 'end' and element.tag == 'id':
linkRow = {} # re-initialize the dictionary for the current link row
linkRow['linkId'] = element.text
if event == 'end' and element.tag == 'direction':
linkRow['direction'] = element.text
# when no more information on the current link is expected, append it to the list of rows
if event == 'end' and element.tag == 'direction':
# link the link to the message it belongs to, then append
linkRow['messageId'] = messageRow['messageId']
listOfRows.append(linkRow)
listOfRows
现在是词典的列表,每个词典在其中存储要放入一个数据帧行中的信息.可以使用此列表作为数据源创建数据框
listOfRows
is now a list of dictionaries where each dictionary stores the information that is to be put into one dataframe row. Creating a dataframe with this list as datasource can be done with
# create dataframe from list of rows and pass column order (would be random otherwise)
df = pd.DataFrame.from_records(listOfRows, columns=['messageId', 'status', 'priority', 'streetName', 'streetLength', 'linkId', 'direction'])
print(df)
并给出原始"数据框:
messageId status priority streetName streetLength linkId \
0 12345 Active Low NaN NaN NaN
1 12345 NaN NaN King Street Short NaN
2 12345 NaN NaN NaN NaN 75838745
3 12345 NaN NaN NaN NaN 168745
4 12345 NaN NaN NaN NaN 975416
5 12345 NaN NaN Queen Street Long NaN
6 12345 NaN NaN NaN NaN 366248
7 12345 NaN NaN NaN NaN 745812
8 54321 Inactive High NaN NaN NaN
9 54321 NaN NaN Princess Street Mid NaN
10 54321 NaN NaN NaN NaN 744154
11 54321 NaN NaN NaN NaN 632214
12 54321 NaN NaN NaN NaN 654785
13 54321 NaN NaN Prince Street Very Long NaN
14 54321 NaN NaN NaN NaN 1022444
15 54321 NaN NaN NaN NaN 4474558
direction
0 NaN
1 NaN
2 North
3 South
4 North
5 NaN
6 West
7 East
8 NaN
9 NaN
10 West
11 South
12 East
13 NaN
14 North
15 South
我们现在可以将感兴趣的列(messageId,streetName,linkId)设置为该数据帧上的MultiIndex:
We can now se the columns of interest (messageId, streetName, linkId) as MultiIndex on that dataframe:
# set the columns of interest as MultiIndex
df = df.set_index(['messageId', 'streetName', 'linkId'])
print(df)
给出:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaN
King Street NaN NaN NaN Short NaN
NaN 75838745 NaN NaN NaN North
168745 NaN NaN NaN South
975416 NaN NaN NaN North
Queen Street NaN NaN NaN Long NaN
NaN 366248 NaN NaN NaN West
745812 NaN NaN NaN East
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
尽管通常不应该考虑在索引中包含NaN
,但在此用例中,我认为它没有任何问题.
Even though having NaN
in an index should be disregarded in general, I don't see any problem with it for this usecase.
最后,要获得通过其messageId
访问单个消息(包括其所有子级"街道和链接)所需的效果,则必须按最外部的索引级别对MultiIndexed数据帧进行分组:
Finally, to get the desired effect of accessing single messages by their messageId
, including all of its "children" streets and links, the MultiIndexed dataframe has to be grouped by the most outer index level:
# group by the most outer index
groups = df.groupby(level='messageId')
现在,例如,您可以使用
Now, you can for example loop over all messages (and do whatever with them) with
# iterate over all groups
for key, group in groups:
print('key: ' + key)
print('group:')
print(group)
print('\n')
返回
key: 12345
group:
status priority streetLength direction
messageId streetName linkId
12345 NaN NaN Active Low NaN NaN
King Street NaN NaN NaN Short NaN
NaN 75838745 NaN NaN NaN North
168745 NaN NaN NaN South
975416 NaN NaN NaN North
Queen Street NaN NaN NaN Long NaN
NaN 366248 NaN NaN NaN West
745812 NaN NaN NaN East
key: 54321
group:
status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
或者您可以通过messageId访问特定的消息,返回包含messageId及其所有专用街道和链接的行:
or you can access specific messages by the messageId, returning the row containing the messageId and also all of its dedicated streets and links:
# get groups by key
print('specific group only:')
print(groups.get_group('54321'))
给予
specific group only:
status priority streetLength direction
messageId streetName linkId
54321 NaN NaN Inactive High NaN NaN
Princess Street NaN NaN NaN Mid NaN
NaN 744154 NaN NaN NaN West
632214 NaN NaN NaN South
654785 NaN NaN NaN East
Prince Street NaN NaN NaN Very Long NaN
NaN 1022444 NaN NaN NaN North
4474558 NaN NaN NaN South
希望这对某人有帮助.
这篇关于将层次结构(树状)的XML读取到pandas数据框中,以保持层次结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!