高效地将 XML 转换为数据框 [英] Efficiently transform XML to data frame

查看:65
本文介绍了高效地将 XML 转换为数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一些 vanilla xml 转换为数据框.XML 是矩形数据的简单表示(参见下面的示例).我可以在 R 中使用 xml2 和几个 for 循环非常简单地实现这一点.但是,我确定有更好/更快的方法(purrr?).我最终将使用的 XML 非常大,因此更有效的方法是首选.我将不胜感激来自社区的任何建议.

I need to transform some vanilla xml into a data frame. The XML is a simple representation of rectangular data (see example below). I can achieve this pretty straightforwardly in R with xml2 and a couple of for loops. However, I'm sure there is a much better/faster way (purrr?). The XML I will be ultimately working with are very large, so more efficient methods are preferred. I would be grateful for any advice from the community.

library(tidyverse)
library(xml2)

demo_xml <- 
"<DEMO>
  <EPISODE>
    <item1>A</item1>
    <item2>1</item2>
  </EPISODE>
  <EPISODE>
    <item1>B</item1>
    <item2>2</item2>
  </EPISODE>
</DEMO>"


dx <- read_xml(demo_xml)

episodes <- xml_find_all(dx, xpath = "//EPISODE")
dx_names <- xml_name(xml_children(episodes[1]))

df <- data.frame()

for(i in seq_along(episodes)) {
  for(j in seq_along(dx_names)) {
    df[i, j] <- xml_text(xml_find_all(episodes[i], xpath = dx_names[j]))
  }
}

names(df) <- dx_names
df
#>   item1 item2
#> 1     A     1
#> 2     B     2

reprex 包 (v0.3.0) 于 2019 年 9 月 19 日创建

Created on 2019-09-19 by the reprex package (v0.3.0)

提前致谢.

推荐答案

这是一个通用解决方案,它为每个父节点处理不同数量的不同子节点.每个剧集节点可能有不同的子节点.
此策略解析标识每个子节点的名称和值的子节点.然后它将此列表转换为更长的样式数据框,然后将其重塑为您想要的更宽的样式:

This is a general solution which handles a varying number of different sub-nodes for each parent node. Each Episode node may have different sub-nodes.
This strategy parses the children nodes identifying the name and values of each sub node. Then it converts this list into a longer style dataframe and then reshapes it into your desired wider style:

library(tidyr)
library(xml2)

demo_xml <- 
  "<DEMO>
  <EPISODE>
    <item1>A</item1>
    <item2>1</item2>
  </EPISODE>
  <EPISODE>
    <item1>B</item1>
    <item2>2</item2>
  </EPISODE>
</DEMO>"

dx <- read_xml(demo_xml)

#find all episodes
episodes <- xml_find_all(dx, xpath = "//EPISODE")
#extract the node names and values from all of the episodes
nodenames<-xml_name(xml_children(episodes))
contents<-trimws(xml_text(xml_children(episodes)))

#Idenitify the number of subnodes under each episodes for labeling
IDlist<-rep(1:length(episodes), sapply(episodes, length))

#make a long dataframe
df<-data.frame(episodes=IDlist, nodenames, contents, stringsAsFactors = FALSE)

#make the dataframe wide, Remove unused blank nodes:
answer <- spread(df[df$contents!="",], nodenames, contents)

#tidyr 1.0.0 version
#answer <- pivot_wider(df, names_from = nodenames, values_from = contents)


# A tibble: 2 x 3
  episodes item1 item2
     <int> <chr> <chr>
1        1 A     1    
2        2 B     2  

这篇关于高效地将 XML 转换为数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆