如何使用 R 的 XML 库使用 xpath 查询? [英] How can I use xpath querying using R's XML library?

查看:29
本文介绍了如何使用 R 的 XML 库使用 xpath 查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

xml 文件有这个片段:

The xml file has this snippet:

<?xml version="1.0"?>
<PC-AssayContainer
    xmlns="http://www.ncbi.nlm.nih.gov"
    xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
    xs:schemaLocation="http://www.ncbi.nlm.nih.gov ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem.xsd"
>
....
    <PC-AnnotatedXRef>
      <PC-AnnotatedXRef_xref>
        <PC-XRefData>
          <PC-XRefData_pmid>17959251</PC-XRefData_pmid>
        </PC-XRefData>
      </PC-AnnotatedXRef_xref>
    </PC-AnnotatedXRef>

我尝试使用 xpath 的全局搜索来解析它,并尝试使用一些命名空间:

I tried to parse it using xpath's global search and also tried with some namespacing:

library('XML')
doc = xmlInternalTreeParse('http://s3.amazonaws.com/tommy_chheng/pubmed/485270.descr.xml')
>xpathApply(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> getNodeSet(doc, "//PC-XRefData_pmid")
list()
attr(,"class")
[1] "XMLNodeSet"
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns="xs")
list()
> xpathApply(doc, "//xs:PC-XRefData_pmid", ns= c(xs = "http://www.w3.org/2001/XMLSchema-instance"))
list()

xpath 不应该匹配:

Shouldn't the xpath match:

<PC-XRefData_pmid>17959251</PC-XRefData_pmid>

推荐答案

由于默认命名空间是 NIH(其 URI 为http://www.ncbi.nlm.nih.gov"),<PC-XRefData_pmid>(以及 XML 文档中没有命名空间前缀的所有其他元素)都在该 NIH 命名空间中.

Since the default namespace is the NIH one (whose URI is "http://www.ncbi.nlm.nih.gov"), <PC-XRefData_pmid> (and every other element in your XML document that has no namespace prefix) is in that NIH namespace.

因此,要将它们与 XPath 匹配,您需要告诉 XPath 处理器您将用于 NIH 命名空间的前缀,并且您需要在 XPath 中使用该前缀.

So to match them with an XPath, you need to tell your XPath processor what prefix you're going to use for the NIH namespace, and you need to use that prefix in your XPath.

所以,在不知道 R 的情况下,我会尝试

So, without knowing R, I would try

xpathApply(doc, "//nih:PC-XRefData_pmid",
   ns= c(nih = "http://www.ncbi.nlm.nih.gov"))

否则

getNodeSet(doc, "//*[local-name() = 'PC-XRefData_pmid']")

因为后者绕过命名空间.

as the latter bypasses namespaces.

仅仅因为 XML 文档将 NIH 命名空间声明为默认命名空间并不意味着 XPath 处理器会知道这一点.在 XML 信息模型中,命名空间前缀并不重要.所以当我在 XML 文档中解析时,NIH 命名空间是绑定到nih:"前缀还是snizzlefritz"并不重要:" 前缀或 ""(默认)前缀.XML 解析器或 XPath 处理器不应该知道什么前缀绑定到 XML 文档中的什么命名空间.特别是因为可能有几个不同的前缀绑定到同一个文档中不同位置的同一个命名空间......反之亦然.因此,如果您想让 XPath 表达式与命名空间中的元素匹配,则必须向 XPath 处理器声明该命名空间.

Just because the XML document declares the NIH namespace as the default one doesn't mean that the XPath processor will know that. In the XML information model, namespace prefixes are not significant. So when I parse in an XML document, it's not supposed to matter whether the NIH namespace is bound to the "nih:" prefix or the "snizzlefritz:" prefix or the "" (default) prefix. The XML parser or XPath processor is not supposed to have to know what prefix got bound to what namespace in the XML document. Especially since there could be several different prefixes bound to the same namespace at different places in the same document... and vice versa. So if you want to have your XPath expression match an element that's in a namespace, you have to declare that namespace to the XPath processor.

有一些警告,由@Jim Pivarski 提供:

There are a few caveats, contributed by @Jim Pivarski:

  • doc"必须是 xml 节点,而不是文档(类XMLNode"或XMLInternalElementNode",而不是XMLDocument"或XMLInternalDocument").
  • 至少在 Jim 的版本 (XML_3.93-0) 中,命名参数是namespaces",而不是ns".

因此,如果doc"是文档类的实例,则正确的解决方案是:

So if "doc" is an instance of a document class, the correct solution is:

xpathApply(xmlRoot(doc), "//nih:PC-XRefData_pmid",
   namespaces = c(nih = "http://www.ncbi.nlm.nih.gov"))

这篇关于如何使用 R 的 XML 库使用 xpath 查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆