R - 如何将 XML 转换为 R 中具有正确结构的数据帧? [英] R - How to convert XML to dataframe in R with the correct structure?

查看:16
本文介绍了R - 如何将 XML 转换为 R 中具有正确结构的数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将 XML 文件转换为数据框.我发现了一些允许我读取 XML 数据的函数,但是我无法获得与初始 XML 文件具有相同结构的数据框(= 如果您在 Excel 中打开 XML 文件将获得的结构).

I would like to convert an XML file into a dataframe. I have found some functions which allow me to read the XML data, however I am not able to get a dataframe with the same structure as the initial XML file (= structure that you would get if you open the XML file in Excel).

这是我的原始 XML 代码:

This is my original XML code:

<Data>
<Frame timestamp='17/09/2014  20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<TrackingFrame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>

这允许我已经获得数据列表:图书馆(XML)

This allows me already to get a list of the data: library(XML)

# Convert xml data to R
data <- xmlTreeParse(file="c:/R/CL/filename.xml",useInternalNode=TRUE)
# Create a list of the data
xl<-xmlToList(data)

理想情况下,我希望获得一个基于此 XML 数据的数据框,该数据框看起来与您在 Excel 中输入 XML 数据时相同.但是,当我查看 xl 的输出时,我发现这是在对象和时间中组织的.通常,当我在 Excel 中打开 XML 文件时,此信息是链接的(每个对象也有包含时间信息的列)

Ideally I would like to get a dataframe based on this XML data that looks the same as when you would input the XML data in Excel. However, when I look at the output of xl then I see that this is organized in Objects and Times. Normally when I open the XML files in Excel this information is linked (and every Object has also columns with Time information)

这是 xl<-xmlToList(data) 的输出:

This is the output of xl<-xmlToList(data):

$Frame$Object
     type         DISTANCE         VOLUME        id       code 
"Taxi"    "3037"    "1668"   "15593"       "0" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3605"   "931" "15603"     "4" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Bus"  "3563"   "488" "15604"     "9" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "2161"  "1592" "15615"    "21" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "4942"    "57" "15624"     "1" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"   "784"    "47" "15625"    "10" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3301"  "2041" "15626"    "42" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Bus"  "2040"  "2945" "15630"    "27" 


$Frame$Object
  type      DISTANCE      VOLUME      Z 
"Airplane" "2865" "2722"    "0" 

$Frame$Time
                timestamp                  timecode 
"17/09/2014 20:54:59.902"                "75299902"

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "4941"    "51" "15624"     "1" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"   "789"    "47" "15625"    "10" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3300"  "2069" "15626"    "42" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Bus"  "2027"  "2947" "15630"    "27" 

$Frame$Object
  type      DISTANCE      VOLUME      Z 
"Airplane" "2865" "2722"    "0" 

$Frame$Time
                timestamp                  timecode 
"17/09/2014 20:54:59.771"                "75299771"

此列表包含 2 个表结构/框架:Frame$Object 和 Frame$Time.我想将这 2 个结构组合成一个组合表(通过重复列时间戳和时间码以及每个对象的时间信息).

This lists contains 2 table structures / frames: Frame$Object and Frame$Time. I would like to combine these 2 structures into one combined table (by repeating the columns timestamp and timecode with the time information for every Object).

在下面查看所需的输出(结构与您在 Excel 中输入 XML 文件时的结构相同):

See here below the desired output (with the same structure as when you would enter the XML file in Excel):

type    DISTANCE    VOLUME  id  code    z   timestamp   timecode
Taxi    3037    1668    15593   0       17/09/2014 20:54:59.902 75299902
Taxi    3605    931 15603   4       17/09/2014 20:54:59.902 75299902
Bus 3563    488 15604   9       17/09/2014 20:54:59.900 75299902
Taxi    4942    57  15624   1       17/09/2014 20:54:59.900 75299902
Taxi    784 47  15625   10      17/09/2014 20:54:59.900 75299902
Taxi    3301    2041    15626   42      17/09/2014 20:54:59.900 75299902
Bus 2040    2945    15630   27      17/09/2014 20:54:59.900 75299902
Airplane    2865    2722            0   17/09/2014 20:54:59.900 75299902
Taxi    4941    51  15624   1        17/09/2014 20:54:59.771    75299771
Taxi    789 47  15625   10       17/09/2014 20:54:59.771    75299771
Taxi    3300    2069    15626   42       17/09/2014 20:54:59.771    75299771
Bus 2027    2947    15630   27       17/09/2014 20:54:59.771    75299771
Airplane    2865    2722            0    17/09/2014 20:54:59.771    75299771

哪些功能可以实现这个结果?在此先感谢您的帮助!

Which functions would work to achieve this result? Thank you on beforehand for your help!

推荐答案

您可以使用 xml2dplyr 进行快速转换:

You can use xml2 and dplyr for a quick conversion:

library(xml2)
library(dplyr)

dat <- "<Data>
<Frame timestamp='17/09/2014  20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<Frame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>"

doc <- read_xml(dat)

# bind the data.frames built in the iterator together
bind_rows(lapply(xml_find_all(doc, "//Frame"), function(x) {

  # extract the attributes from the parent tag as a data.frame
  parent <- data.frame(as.list(xml_attrs(x)), stringsAsFactors=FALSE)

  # make a data.frame out of the attributes of the kids
  kids <- bind_rows(lapply(xml_children(x), function(x) as.list(xml_attrs(x))))

  # combine them
  cbind.data.frame(parent, kids, stringsAsFactors=FALSE)

}))

## Source: local data frame [13 x 8]
## 
##                   timestamp timecode     type DISTANCE VOLUME    id  code     Z
##                       (chr)    (chr)    (chr)    (chr)  (chr) (chr) (chr) (chr)
## 1  17/09/2014  20:55:00.902 75299902     Taxi     3037   1668 15593     0    NA
## 2  17/09/2014  20:55:00.902 75299902     Taxi     3605    931 15603     4    NA
## 3  17/09/2014  20:55:00.902 75299902      Bus     3563    488 15604     9    NA
## 4  17/09/2014  20:55:00.902 75299902     Taxi     4942     57 15624     1    NA
## 5  17/09/2014  20:55:00.902 75299902     Taxi      784     47 15625    10    NA
## 6  17/09/2014  20:55:00.902 75299902     Taxi     3301   2041 15626    42    NA
## 7  17/09/2014  20:55:00.902 75299902      Bus     2040   2945 15630    27    NA
## 8  17/09/2014  20:55:00.902 75299902 Airplane     2865   2722    NA    NA     0
## 9   17/09/2014 20:54:59.771 75299771     Taxi     4941     51 15624     1    NA
## 10  17/09/2014 20:54:59.771 75299771     Taxi      789     47 15625    10    NA
## 11  17/09/2014 20:54:59.771 75299771     Taxi     3300   2069 15626    42    NA
## 12  17/09/2014 20:54:59.771 75299771      Bus     2027   2947 15630    27    NA
## 13  17/09/2014 20:54:59.771 75299771 Airplane     2865   2722    NA    NA     0

您需要根据需要转换类型.

You'll need to convert the types as necessary.

如果您坚持使用 XML 包,您可以做类似的事情:

You can do something similar if you're stuck with the XML package:

doc <- xmlParse(dat)

bind_rows(xpathApply(doc, "//Frame", function(x) {
  parent <- data.frame(as.list(xmlAttrs(x)), stringsAsFactors=FALSE)
  kids <- bind_rows(lapply(xmlChildren(x), function(x) as.list(xmlAttrs(x))))
  cbind.data.frame(parent, kids, stringsAsFactors=FALSE)
}))

这篇关于R - 如何将 XML 转换为 R 中具有正确结构的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆