Python/R:当不是所有节点都包含所有变量时,从XML生成数据帧吗? [英] Python/R: generate dataframe from XML when not all nodes contain all variables?

查看:69
本文介绍了Python/R:当不是所有节点都包含所有变量时,从XML生成数据帧吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下XML示例

library(xml2)

myxml <- read_xml('
<data>
  <obs ID="a">
  <name> John </name>
  <hobby> tennis </hobby>
  <hobby> golf </hobby>
  <skill> python  </skill>
  </obs>
  <obs ID="b">
  <name> Robert </name>
  <skill> R </skill>
  </obs>
  </data>
')

在这里,我想从包含列namehobby的XML中获取(R或Pandas)数据框.

Here I would like to get an (R or Pandas) dataframe from this XML that contains the columns name and hobby.

但是,如您所见,存在对齐问题,因为第二个节点中缺少hobby并且约翰有两个爱好.

However, as you see, there is an alignment problem because hobby is missing in the second node and John has two hobbies.

在R中,我知道如何一次提取一个特定值,例如使用xml2如下:

in R, I know how to extract specific values one at a time, for instance using xml2 as follows:

myxml%>% 
  xml_find_all("//name") %>% 
  xml_text()

myxml%>% 
  xml_find_all("//hobby") %>% 
  xml_text()

但是如何在数据框中正确对齐此数据?也就是说,如何获得如下数据框(请注意我是如何与| John的两个爱好一起加入的):

but how can I align this data correctly in a dataframe? That is, how can I obtain a dataframe as follows (note how I join with a | the two hobbies of John):

# A tibble: 2 × 3
    name           hobby            skill
   <chr>           <chr>            <chr>
1   John          tennis|golf       python
2 Robert            <NA>            R

在R中,我更喜欢使用xml2dplyr的解决方案.在Python中,我想以Pandas数据框结尾.另外,在我的xml中,还有很多我想解析的变量.我希望有一种解决方案,允许用户解析其他变量,而又不会对代码造成太多干扰.

In R, I would prefer a solution using xml2 and dplyr. In Python, I want to end-up with a Pandas dataframe. Also, in my xml there are many more variables I want to parse. I would like a solution that has allows the user to parse additional variables without messing too much with the code.

谢谢!

感谢大家提供的出色解决方案.所有这些都非常好,有很多细节,很难选出最好的一个.再次感谢!

thanks to everyone for these great solutions. All of them were really nice, with plenty of details and it was hard to pick up the best one. Thanks again!

推荐答案

pandas

pandas

import pandas as pd
from collections import defaultdict
import xml.etree.ElementTree as ET


xml_txt = """<data>
  <obs ID="a">
  <name> John </name>
  <hobby> tennis </hobby>
  <hobby> golf </hobby>
  <skill> python  </skill>
  </obs>
  <obs ID="b">
  <name> Robert </name>
  <skill> R </skill>
  </obs>
  </data>"""

etree = ET.fromstring(xml_txt)

def obs2series(o):
    d = defaultdict(list)
    [d[c.tag].append(c.text.strip()) for c in o.getchildren()];
    return pd.Series(d).str.join('|')

pd.DataFrame([obs2series(o) for o in etree.findall('obs')])

         hobby    name   skill
0  tennis|golf    John  python
1          NaN  Robert       R


工作原理

  • 从字符串中构建元素树.否则,请执行类似et = ET.parse('my_data.xml')
  • 的操作
  • etree.findall('obs')返回xml结构中属于'obs'标记的元素的列表
  • 我将每一个传递给pd.Series构造函数obs2series
  • obs2series内,我循环遍历一个'obs'元素中的所有子节点.
  • defaultdict默认为list,这意味着即使以前没有看到键,我也可以追加到值.
  • 最后我得到了一个字典列表.我将此传递给pd.Series以获得一系列列表.
  • 使用pd.Series.str.join('|'),我可以根据需要将其转换为一系列字符串.
  • 我从一开始就遍历观察的列表理解现在是一系列列表,可以传递给pd.DataFrame构造函数.
  • build an element tree from the string. Otherwise do something like et = ET.parse('my_data.xml')
  • etree.findall('obs') returns a list of elements within the xml structure that are 'obs' tags
  • I pass each of these to a pd.Series constructor obs2series
  • Within obs2series I loop through all child nodes in one 'obs' element.
  • defaultdict defaults to a list meaning I can append to a value even if the key hasn't been seen before.
  • I end up with a dictionary of lists. I pass this to pd.Series to get a series of lists.
  • Using pd.Series.str.join('|') I convert this to a series of strings as I wanted.
  • My list comprehension in the beginning that looped over observations is now a list of series and ready to passed to the pd.DataFrame constructor.

这篇关于Python/R:当不是所有节点都包含所有变量时,从XML生成数据帧吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆