在R中解析XML:不正确的名称空间 [英] Parsing XML in R: Incorrect namespaces

查看:135
本文介绍了在R中解析XML:不正确的名称空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆XML文件和一个R脚本,该脚本将其内容读入数据帧.但是,现在我得到了要照常解析的文件,但是它们的命名空间定义中有些东西不允许我使用XPath表达式正常选择它们的值.

I have a bunch of XML files and an R script that reads their content into a data frame. However, I got now files which I wanted to parse as usual, but there is something in their namespace definition that doesn't allow me to pick their values normally with XPath expressions.

XML文件如下:

xml_nons.xml

xml_nons.xml

<?xml version="1.0" encoding="UTF-8"?>
<XML>
   <Node>
      <Name>Name 1</Name>
      <Title>Title 1</Title>
      <Date>2015</Date>
   </Node>
</XML>

另一个:

xml_ns.xml

xml_ns.xml

<?xml version="1.0" encoding="UTF-8"?>
<XML xmlns="http://www.nonexistingsite.com">
   <Node>
      <Name>Name 2</Name>
      <Title>Title 2</Title>
      <Date>2014</Date>
   </Node>
</XML>

xmlns指向的URL不存在.

The URL where xmlns points to doesn't exist.

我使用的R代码是这样的:

The R code I use is like this:

library(XML)

xmlfiles <- list.files(path = ".", 
                       pattern="*.xml$", 
                       full.names = TRUE, 
                       recursive = TRUE)

n <- length(xmlfiles)
dat <- vector("list", n)

for(i in 1:n){
       doc <- xmlTreeParse(xmlfiles[i], useInternalNodes = TRUE)
       nodes <- getNodeSet(doc, "//XML")
       x <- lapply(nodes, function(x){ data.frame(
              Filename = xmlfiles[i],
              Name = xpathSApply(x, ".//Node/Name" , xmlValue),
              Title = xpathSApply(x, ".//Node/Title" , xmlValue),
              Date = xpathSApply(x, ".//Node/Date" , xmlValue)
            )})
            dat[[i]] <- do.call("rbind", x)
    }

    xml <- do.call("rbind", dat)
    xml

但是,我得到的结果是:

However, what I get as a result is:

Filename            Name    Title    Date
./xml_nons.xml      Name 1  Title 1  2015

如果我从第二个文件中删除了命名空间链接,我会正确的:

If I remove the namespace link from the second file I get correct:

Filename            Name    Title    Date
./xml_nons_1.xml    Name 1  Title 1  2015
./xml_ns_1.xml      Name 2  Title 2  2014

我当然可以有一个XSL从原始XML文件中删除那些命名空间,但是我想有一些在R中可以使用的解决方案.是否有某种方法可以告诉R只是忽略XML声明中的所有内容?

Of course I could have an XSL to remove those namespaces from original XML files, but I would like to have some solution that works within R. Is there some way to tell R just to ignore everything in the XML declaration?

推荐答案

我认为没有简单的方法可以忽略命名空间.最好的方法是学习与他们生活在一起.该答案将使用较新的XML2包.但是,同样适用于XML包解决方案.

I think there is no easy way to ignore the namespaces. The best way is to learn to live with them. This answer will use the newer XML2 package. But the same applies to the XML package solution.

使用

library(XML2)
fname='myfile.xml'
doc <- read_xml(fname)
#peak at the namespaces
xml_ns(doc)

第一个名称空间已分配给d1.如果您的XPath找不到所需的内容,则最可能的原因是名称空间问题.

The first namespace is assigned to d1. If you XPath does not find what you want, the most likely cause is the namespace issue.

xpath <-  "//d1:FormDef"
ns <- xml_find_all(doc,xpath, xml_ns(doc))
ns

此外,您必须对路径中的每个元素都执行此操作 因此,保存输入即可,

Also, you have to do this for every element in the path So to save typing, you can do

library(stringr)
> xpath <-  "/ODM/Study"
> (xpath<-str_replace_all(xpath,'/','/d1:'))
[1] "/d1:ODM/d1:Study"

这篇关于在R中解析XML:不正确的名称空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆