在R中解析XML:不正确的名称空间 [英] Parsing XML in R: Incorrect namespaces
问题描述
我有一堆XML文件和一个R脚本,该脚本将其内容读入数据帧.但是,现在我得到了要照常解析的文件,但是它们的命名空间定义中有些东西不允许我使用XPath表达式正常选择它们的值.
I have a bunch of XML files and an R script that reads their content into a data frame. However, I got now files which I wanted to parse as usual, but there is something in their namespace definition that doesn't allow me to pick their values normally with XPath expressions.
XML文件如下:
xml_nons.xml
xml_nons.xml
<?xml version="1.0" encoding="UTF-8"?>
<XML>
<Node>
<Name>Name 1</Name>
<Title>Title 1</Title>
<Date>2015</Date>
</Node>
</XML>
另一个:
xml_ns.xml
xml_ns.xml
<?xml version="1.0" encoding="UTF-8"?>
<XML xmlns="http://www.nonexistingsite.com">
<Node>
<Name>Name 2</Name>
<Title>Title 2</Title>
<Date>2014</Date>
</Node>
</XML>
xmlns指向的URL不存在.
The URL where xmlns points to doesn't exist.
我使用的R代码是这样的:
The R code I use is like this:
library(XML)
xmlfiles <- list.files(path = ".",
pattern="*.xml$",
full.names = TRUE,
recursive = TRUE)
n <- length(xmlfiles)
dat <- vector("list", n)
for(i in 1:n){
doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE)
nodes <- getNodeSet(doc, "//XML")
x <- lapply(nodes, function(x){ data.frame(
Filename = xmlfiles[i],
Name = xpathSApply(x, ".//Node/Name" , xmlValue),
Title = xpathSApply(x, ".//Node/Title" , xmlValue),
Date = xpathSApply(x, ".//Node/Date" , xmlValue)
)})
dat[[i]] <- do.call("rbind", x)
}
xml <- do.call("rbind", dat)
xml
但是,我得到的结果是:
However, what I get as a result is:
Filename Name Title Date
./xml_nons.xml Name 1 Title 1 2015
如果我从第二个文件中删除了命名空间链接,我会正确的:
If I remove the namespace link from the second file I get correct:
Filename Name Title Date
./xml_nons_1.xml Name 1 Title 1 2015
./xml_ns_1.xml Name 2 Title 2 2014
我当然可以有一个XSL从原始XML文件中删除那些命名空间,但是我想有一些在R中可以使用的解决方案.是否有某种方法可以告诉R只是忽略XML声明中的所有内容?>
Of course I could have an XSL to remove those namespaces from original XML files, but I would like to have some solution that works within R. Is there some way to tell R just to ignore everything in the XML declaration?
推荐答案
我认为没有简单的方法可以忽略命名空间.最好的方法是学习与他们生活在一起.该答案将使用较新的XML2包.但是,同样适用于XML包解决方案.
I think there is no easy way to ignore the namespaces. The best way is to learn to live with them. This answer will use the newer XML2 package. But the same applies to the XML package solution.
使用
library(XML2)
fname='myfile.xml'
doc <- read_xml(fname)
#peak at the namespaces
xml_ns(doc)
第一个名称空间已分配给d1.如果您的XPath找不到所需的内容,则最可能的原因是名称空间问题.
The first namespace is assigned to d1. If you XPath does not find what you want, the most likely cause is the namespace issue.
xpath <- "//d1:FormDef"
ns <- xml_find_all(doc,xpath, xml_ns(doc))
ns
此外,您必须对路径中的每个元素都执行此操作 因此,保存输入即可,
Also, you have to do this for every element in the path So to save typing, you can do
library(stringr)
> xpath <- "/ODM/Study"
> (xpath<-str_replace_all(xpath,'/','/d1:'))
[1] "/d1:ODM/d1:Study"
这篇关于在R中解析XML:不正确的名称空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!