从XML文件创建数据框列表的更直接方法? [英] More direct way to create list of dataframes from XML file?
问题描述
SDMX (统计数据和元数据交换)是一种"XML"语法,它定义了交换统计数据的标准.它使用称为数据集结构定义描述(DSD)的文件来传达数据集的结构.DSD除其他外还包含一个节点 Codelists
,该节点由 Codelist
项组成,而这些项又是 Code
和 Name的父项
项目和属性.我目前正在尝试解析
SDMX (Statistical Data and Metadata Exchange) is a 'XML' grammar that defines a standard for exchanging statistical data. It uses files called Dataset Structure Definition Description (DSD) to convey the structure of a dataset. Amongst other things the DSD contains a node Codelists
that is comprised of the Codelist
items which in turn are parent to the Code
and Name
item and attribuet. I am currently trying to parse these Codelists of a DSD file requested from Eurostats REST interface into a list of dataframes in R using the following code:
library(XML);library(RCurl)
# REST resource for DSD of nama_gdp_c
# downloading, parsing XML an setting root
file <- "http://ec.europa.eu/eurostat/SDMX/diss-web/rest/datastructure/ESTAT/DSD_nama_gdp_c"
content <- getURL(file, httpheader = list('User-Agent' = 'R-Agent'))
root <- xmlRoot(xmlInternalTreeParse(content, useInternalNodes = TRUE))
# get Nodeset of Codelists and its length
nodes <- getNodeSet(root,"//str:Codelist")
nn <- length(nodes)
# Create nested List of all Codes and Names
codelistAll <- lapply(seq(nn),function(i){
xpathSApply(root,paste0("//str:Codelist[",i,"]/str:Code"),xmlGetAttr, "id")
})
namelistAll <- lapply(seq(nn),function(i){
xpathSApply(root,paste0("//str:Codelist[",i,"]/str:Code/com:Name"),xmlValue)
})
# Create a list of dataframes from the nested lists
alldfList <-lapply(seq(nn),function(i) data.frame(codes=codelistAll[[i]],names=namelistAll[[i]]))
# Name the list items like the nodes
names(alldfList) <- sapply(nodes, xmlGetAttr,"id")
这将产生 alldfList
,这是我一直在寻找的数据帧列表.
This yields alldfList
, the list of dataframes which I was looking for.
> str(alldfList)
List of 6
$ CL_FREQ :'data.frame': 6 obs. of 2 variables:
..$ codes: Factor w/ 6 levels "A","D","H","M",..: 2 6 5 1 4 3
..$ names: Factor w/ 6 levels "Annual","Daily",..: 2 6 4 1 3 5
$ CL_GEO :'data.frame': 49 obs. of 2 variables:
..$ codes: Factor w/ 49 levels "AT","BA","BE",..: 22 21 20 10 16 15 14 13 12 11 ...
..$ names: Factor w/ 49 levels "Austria","Belgium",..: 19 18 17 16 15 14 13 12 11 10 ...
尽管这样做可以完成工作,但我觉得必须有一种更简单的语法来实现此目的.特别是 paste0
的使用和名称的最终分配似乎很尴尬.我一直在阅读 XML
包的文档,我怀疑它必须是对 xlmChildren
的某些操作,但我无法全神贯注于实际操作方法.有没有人建议进行此操作的规范方法?任何建议将不胜感激.
While this does the job, I have the feeling that there must be a more straightforward syntax to achieve this. Especially the use of paste0
and the final assignment of names seem awkward. I have been reading through the documentation of the XML
package and I suspect it must be some operation on the xlmChildren
but I cannot wrap my head around how to actually do it. Does anyone have a suggestion for a canonical way of doing this operation? Any suggestion would be greatly appreciated.
推荐答案
您可以直接从节点获取data.frames,但需要使用命名空间
You can get the data.frames directly from nodes, but need to use a namespace
ns <- c(str="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure")
alldfList <- lapply(nodes, function(x){ data.frame(
codes= xpathSApply(x, ".//str:Code" , xmlGetAttr, "id", namespaces=ns),
names= xpathSApply(x, ".//str:Code" , xmlValue, namespaces=ns) )})
names(alldfList) <- sapply(nodes, xmlGetAttr,"id")
这篇关于从XML文件创建数据框列表的更直接方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!