R:将XML数据转换为数据帧 [英] R: convert XML data to data frame
问题描述
对于一项家庭作业,我试图将XML文件转换为R中的数据帧。我尝试了许多不同的方法,并且在Internet上搜索了一些想法,但均未成功。到目前为止,这是我的代码:
For a homework assignment I am attempting to convert an XML file into a data frame in R. I have tried many different things, and I have searched for ideas on the internet but have been unsuccessful. Here is my code so far:
library(XML)
url <- 'http://www.ggobi.org/book/data/olive.xml'
doc <- xmlParse(myUrl)
root <- xmlRoot(doc)
dataFrame <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
data.frame(t(dataFrame),row.names=NULL)
我得到的输出就像一个巨大的数字向量。我正在尝试将数据组织到一个数据框中,但是我不知道如何适当地调整代码来获取它。
The output I get is like a giant vector of numbers. I am attempting to organize the data into a data frame, but I do not know how to properly adjust my code to obtain that.
推荐答案
它可能不像 XML
包那样 verbose ,但是 xml2
却没有内存泄漏,并且专注于数据提取。我用的是 trimws
,这是R核心最近真正添加的东西。
It may not be as verbose as the XML
package but xml2
doesn't have the memory leaks and is laser-focused on data extraction. I use trimws
which is a really recent addition to R core.
library(xml2)
pg <- read_xml("http://www.ggobi.org/book/data/olive.xml")
# get all the <record>s
recs <- xml_find_all(pg, "//record")
# extract and clean all the columns
vals <- trimws(xml_text(recs))
# extract and clean (if needed) the area names
labs <- trimws(xml_attr(recs, "label"))
# mine the column names from the two variable descriptions
# this XPath construct lets us grab either the <categ…> or <real…> tags
# and then grabs the 'name' attribute of them
cols <- xml_attr(xml_find_all(pg, "//data/variables/*[self::categoricalvariable or
self::realvariable]"), "name")
# this converts each set of <record> columns to a data frame
# after first converting each row to numeric and assigning
# names to each column (making it easier to do the matrix to data frame conv)
dat <- do.call(rbind, lapply(strsplit(vals, "\ +"),
function(x) {
data.frame(rbind(setNames(as.numeric(x),cols)))
}))
# then assign the area name column to the data frame
dat$area_name <- labs
head(dat)
## region area palmitic palmitoleic stearic oleic linoleic linolenic
## 1 1 1 1075 75 226 7823 672 NA
## 2 1 1 1088 73 224 7709 781 31
## 3 1 1 911 54 246 8113 549 31
## 4 1 1 966 57 240 7952 619 50
## 5 1 1 1051 67 259 7771 672 50
## 6 1 1 911 49 268 7924 678 51
## arachidic eicosenoic area_name
## 1 60 29 North-Apulia
## 2 61 29 North-Apulia
## 3 63 29 North-Apulia
## 4 78 35 North-Apulia
## 5 80 46 North-Apulia
## 6 70 44 North-Apulia
更新
我会尽力做到这一点
library(tidyverse)
strsplit(vals, "[[:space:]]+") %>%
map_df(~as_data_frame(as.list(setNames(., cols)))) %>%
mutate(area_name=labs)
这篇关于R:将XML数据转换为数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!