Python/R:当不是所有节点都包含所有变量时,从XML生成数据帧吗? [英] Python/R: generate dataframe from XML when not all nodes contain all variables?
问题描述
请考虑以下XML
示例
library(xml2)
myxml <- read_xml('
<data>
<obs ID="a">
<name> John </name>
<hobby> tennis </hobby>
<hobby> golf </hobby>
<skill> python </skill>
</obs>
<obs ID="b">
<name> Robert </name>
<skill> R </skill>
</obs>
</data>
')
在这里,我想从包含列name
和hobby
的XML中获取(R或Pandas)数据框.
Here I would like to get an (R or Pandas) dataframe from this XML that contains the columns name
and hobby
.
但是,如您所见,存在对齐问题,因为第二个节点中缺少hobby
并且约翰有两个爱好.
However, as you see, there is an alignment problem because hobby
is missing in the second node and John has two hobbies.
在R中,我知道如何一次提取一个特定值,例如使用xml2
如下:
in R, I know how to extract specific values one at a time, for instance using xml2
as follows:
myxml%>%
xml_find_all("//name") %>%
xml_text()
myxml%>%
xml_find_all("//hobby") %>%
xml_text()
但是如何在数据框中正确对齐此数据?也就是说,如何获得如下数据框(请注意我是如何与|
John的两个爱好一起加入的):
but how can I align this data correctly in a dataframe? That is, how can I obtain a dataframe as follows (note how I join with a |
the two hobbies of John):
# A tibble: 2 × 3
name hobby skill
<chr> <chr> <chr>
1 John tennis|golf python
2 Robert <NA> R
在R中,我更喜欢使用xml2
和dplyr
的解决方案.在Python中,我想以Pandas数据框结尾.另外,在我的xml中,还有很多我想解析的变量.我希望有一种解决方案,允许用户解析其他变量,而又不会对代码造成太多干扰.
In R, I would prefer a solution using xml2
and dplyr
. In Python, I want to end-up with a Pandas dataframe. Also, in my xml there are many more variables I want to parse. I would like a solution that has allows the user to parse additional variables without messing too much with the code.
谢谢!
感谢大家提供的出色解决方案.所有这些都非常好,有很多细节,很难选出最好的一个.再次感谢!
thanks to everyone for these great solutions. All of them were really nice, with plenty of details and it was hard to pick up the best one. Thanks again!
推荐答案
pandas
pandas
import pandas as pd
from collections import defaultdict
import xml.etree.ElementTree as ET
xml_txt = """<data>
<obs ID="a">
<name> John </name>
<hobby> tennis </hobby>
<hobby> golf </hobby>
<skill> python </skill>
</obs>
<obs ID="b">
<name> Robert </name>
<skill> R </skill>
</obs>
</data>"""
etree = ET.fromstring(xml_txt)
def obs2series(o):
d = defaultdict(list)
[d[c.tag].append(c.text.strip()) for c in o.getchildren()];
return pd.Series(d).str.join('|')
pd.DataFrame([obs2series(o) for o in etree.findall('obs')])
hobby name skill
0 tennis|golf John python
1 NaN Robert R
工作原理
- 从字符串中构建元素树.否则,请执行类似
et = ET.parse('my_data.xml')
的操作
-
etree.findall('obs')
返回xml
结构中属于'obs'
标记的元素的列表 - 我将每一个传递给
pd.Series
构造函数obs2series
- 在
obs2series
内,我循环遍历一个'obs'
元素中的所有子节点. -
defaultdict
默认为list
,这意味着即使以前没有看到键,我也可以追加到值. - 最后我得到了一个字典列表.我将此传递给
pd.Series
以获得一系列列表. - 使用
pd.Series.str.join('|')
,我可以根据需要将其转换为一系列字符串. - 我从一开始就遍历观察的列表理解现在是一系列列表,可以传递给
pd.DataFrame
构造函数.
- build an element tree from the string. Otherwise do something like
et = ET.parse('my_data.xml')
etree.findall('obs')
returns a list of elements within thexml
structure that are'obs'
tags- I pass each of these to a
pd.Series
constructorobs2series
- Within
obs2series
I loop through all child nodes in one'obs'
element. defaultdict
defaults to alist
meaning I can append to a value even if the key hasn't been seen before.- I end up with a dictionary of lists. I pass this to
pd.Series
to get a series of lists. - Using
pd.Series.str.join('|')
I convert this to a series of strings as I wanted. - My list comprehension in the beginning that looped over observations is now a list of series and ready to passed to the
pd.DataFrame
constructor.
这篇关于Python/R:当不是所有节点都包含所有变量时,从XML生成数据帧吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!