使用HtmlUnit在XPath中选择默认命名空间 [英] Select default namespace in XPath with HtmlUnit

查看:114
本文介绍了使用HtmlUnit在XPath中选择默认命名空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用HtmlUnit解析一个Feedburner feed。
Feed是这一个: http://feeds.feedburner.com/alcoanewsreleases p>

在这个提要中,我想读取所有 item 节点,所以通常是 // item XPath应该做到这一点。不幸的是,在这种情况下不起作用。



常规代码片段:

  def page = webClient.getPage(http://feeds.feedburner.com/alcoanewsreleases)
def elements = page.getByXPath(// item)

XML Feed示例:

 < ;?xml version =1.0encoding =UTF-8?> 
<?xml-stylesheet type =text / xslmedia =screenhref =/〜d / styles / rss1full.xsl?>
<?xml-stylesheet type =text / cssmedia =screenhref =http://feeds.feedburner.com/~d/styles/itemcontent.css?>


[... SNIP ...]

< title> Chris L. Ayers Alcoa Global Primary Products< / title>
< dc:date> 2011-05-18< / dc:date
< link> http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail。 ASP< /链路>美国商业资讯纽约消息 - 美国铝业公司(纽约证券交易所股票代码:AA)今天宣布,Chris L. Ayers被任命为美国铝业全球初级产品(GPP)业务总裁,2011年5月18日生效。曾担任GPP首席运营官的Ayers接任John Thuestad,他将负责处理公司的特殊项目。艾雅斯于2010年2月加入美铝,担任美铝铸造,锻造和挤出产品的首席运营官,担任新职位。他于2010年4月当选为美国铝业副总裁,并执行< / description>
< feedburner:origLink xmlns:feedburner =http://rssnamespace.org/feedburner/ext/1.0> http://www.alcoa.com/global/en/news/news_detail.asp? newsYear = 2010&放大器;放大器;的pageID = 20100104006194en< / Feedburner的:origLink>
< / item>

[... SNIP ...]

< / rdf:RDF>

我怀疑这是命名空间的问题,因为这个文档有4个命名空间。命名空间为


  • (这是默认设置) xmlns =http://purl.org/rss /1.0/

  • xmlns:rdf =http://www.w3.org/1999/02/22-rdf-syntax-ns#

  • xmlns:dc =http://purl.org/dc/elements/1.1/

  • xmlns:feedburner =http://rssnamespace.org/feedburner/ ext / 1.0



我尝试过使用Nokogiri(另一个用于ruby脚本的XML解析器)。
使用Nokogiri我可以使用XPath // xmlns:item ,它可以工作并返回来自Feed的所有节点。



我已经用HtmlUnit尝试过相同的XPath,但它不起作用。



所以我想我可以用下面的语句来解释我的问题:
如何使用HtmlUnit从默认名称空间中选择节点?



有什么想法?

解决方案


从这个提要中,我想读取所有 item
个节点,所以通常 // item XPath
应该有效。不幸的是
在这种情况下不起作用。

在XPath中,这意味着选择所有本地名称 item 不在命名空间中。在RSS中,项目元素必须位于命名空间中。因此,上述内容不应用于符合XML解析器和XPath引擎。



令人困惑的是,在XML中,< item> 表示位于默认名称空间中的名为item的元素,即无论默认名称空间在文档中此位置的范围内;而在XPath中,item是指 no 命名空间中的一个元素。 (或者,你可以说,它意味着默认命名空间中的一个元素,但是除非你有办法告诉XPath默认的命名空间是什么,否则默认的命名空间不是命名空间。通常(总是?)在XPath 1.0中是没有办法的以声明XPath表达式的默认命名空间。)



另一个让初学者感到困惑的事情是,源XML文档中的命名空间前缀映射不被XPath处理器认为是重要的。在解析XML文档时,会构建一个数据结构,用于记录每个元素(和其他节点)的名称和名称空间。所使用的命名空间 prefixes ,包括默认命名空间的空白前缀,都被认为仅仅是语法上的便利。在下面的更多...


与Nokogiri我可以只是我们
XPath // xmlns :item which works and
returns from the feed of all nodes。


无论如何,它不是XPath。也许这是一个Nokogiri扩展(它是一个非常方便的扩展),但它的语法实际上是违反直觉的。


所以我想我可以短语我的问题
as:如何使用HtmlUnit从
默认命名空间中选择节点?


让我们短语如下:我如何选择HtmlUnit的RSS项目元素?我这样说,因为RSS规范(实际上一般符合任何XML词汇表规范)不要求它的元素将在默认名称空间中。在您收到的示例中,情况恰好如此,但服务提供商可能会在明天改变这一点,并且仍然完全符合RSS。明天,服务提供者可以使用该命名空间的rss命名空间前缀;或任何其他任意前缀。 RSS 所做的指定的是其元素的名称空间:URI为的空间



有点像问:我该如何编写一个函数(用Javascript,C,Java等)来告诉我变量的值 A ?通常一个函数不知道在调用者中使用了什么变量名称。所有它知道的是它的参数的。如果您调用 sqrt(4),您将得到与 a = 4相同的答案; sqrt(a) rumpelstiltzkin = 4; SQRT(rumpelstiltzkin)。显然,变量参数的名称对函数调用的结果没有直接影响。它只需要是一个具有正确值的变量的名称。如果编译器抱怨,因为你写了 b = 4;返回sqrt(b)而不是使用 a ,你会认为编译器是疯了。只要您使用有效的标识符,它就不应该关心变量名称。



同样,在处理RSS时,我们不应该关心什么名称空间使用前缀,只要它是标识正确名称空间的前缀即可。它可以不是前缀(标识默认名称空间)。



在XPath 2.0中,可以通配名称空间。如果你知道你不需要命名空间来消除歧义,这非常方便。在这种情况下,您可以选择 // *:item 。但是,我不认为HTMLUnit支持XPath 2.0。同样在XPath 2.0环境(如XSLT 2.0)中,您可以为XPath表达式指定默认命名空间,但这对HTMLUnit无帮助。



所以你有几个选择:
$ b


  • 使用忽略名称空间的XPath表达式,例如 // * [local-name()= 'item']






  • 可靠的方法:为 http://purl.org/rss/1.0/ 注册一个名称空间前缀,并将其用于XPath表达式中: // RSS:项目。接下来的问题就是,如何在HTMLUnit中注册一个名称空间前缀并将其传递给XPath处理器?我快速浏览了文档,并没有找到任何工具。



警告:我应该补充说以上是关于符合XPath处理器的。我不知道XPath处理器HTMLUnit使用什么。那里有一些XPath处理器,它们忽略了规范,并且让世界更加困惑于每个人。



我看到这里有人对HTMLUnit中的默认命名空间中的元素使用了以下语法:

  //:item 

但是我不会' t建议,原因有三:


  1. 这不是有效的XPath,因此您不能指望它能与其他程序一起使用。

  2. 它仅适用于将RSS名称空间声明为默认名称空间的RSS源。使用名称空间前缀的RSS提要会导致上述操作失败。

  3. 它会阻止您了解XML名称空间的真实工作方式,并且有助于保留不足以支持名称空间的工具的现状。

HTMLUnit主要是为HTML设计的,因此不完整的处理的XML是可以理解的。但声称支持XPath,然后不提供声明名称空间前缀的方法是 bug 。 HTMLUnit使用似乎是Xalan-J一部分的XPath包。该软件包的提供名称空间映射到XPath的方式,但我不知道HTMLUnit是否公开了该功能。


I want to parse a Feedburner feed with HtmlUnit. The feed is this one: http://feeds.feedburner.com/alcoanewsreleases

From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.

groovy code snippet:

def page = webClient.getPage("http://feeds.feedburner.com/alcoanewsreleases")
def elements = page.getByXPath("//item")

Sample of the XML feed:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss1full.xsl"?>
<?xml-stylesheet type="text/css" media="screen" href="http://feeds.feedburner.com/~d/styles/itemcontent.css"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">

[...SNIP...]

<item rdf:about="http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2011&amp;pageID=20110518006002en">
    <title>Chris L. Ayers Named President, Alcoa Global Primary Products</title>
    <dc:date>2011-05-18</dc:date
    <link>http://feedproxy.google.com/~r/alcoanewsreleases/~3/PawvdhpJrkc/news_detail.asp</link>
    <description>NEW YORK--(BUSINESS WIRE)--Alcoa (NYSE:AA) announced today that Chris L. Ayers has been named President of Alcoa’s Global Primary Products (GPP) business, effective May 18, 2011. Ayers, previously Chief Operating Officer of GPP, succeeds John Thuestad, who will be handling special projects for the Company. Ayers joined Alcoa in February 2010 as Chief Operating Officer of Alcoa Cast, Forged and Extruded Products, a new position. He was elected a Vice President of Alcoa in April 2010 and Executive</description>
    <feedburner:origLink xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">http://www.alcoa.com/global/en/news/news_detail.asp?newsYear=2010&amp;pageID=20100104006194en</feedburner:origLink>
</item>

[...SNIP...]

</rdf:RDF>

I suspect this to be an issue with the namespaces because this document has 4 namespaces. The namespaces are

  • (this is the default) xmlns="http://purl.org/rss/1.0/"
  • xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  • xmlns:dc="http://purl.org/dc/elements/1.1/"
  • xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0"

I have tried to use Nokogiri with this (another XML Parser that I use for ruby scripts). With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.

I have tried the same XPath with HtmlUnit but it does not work.

So I think I can phrase my question as: How can I select a node from the default namespace with HtmlUnit?

Any ideas?

解决方案

From this feed I want to read all item nodes, so normally a //item XPath should do the trick. Unfortunately that does not work in this case.

In XPath, that means "select all elements whose local name is item that are in no namespace". In RSS, the item elements must be in a namespace. So the above should never work with a conforming XML parser and XPath engine.

What's confusing is that in XML, <item> means "an element named item that is in the default namespace, i.e. whatever default namespace is in scope at this place in the document;" whereas in XPath, "item" means an element in no namespace. (Or, you could say, it means an element in the default namespace, but unless you have a way to tell XPath what the default namespace is, the default namespace is no namespace. Usually (always?) in XPath 1.0 there is no way to declare the default namespace for XPath expressions.)

The other confusing thing to beginners is that the namespace prefix mappings in the source XML document are not considered significant by the XPath processor. When the XML document is parsed, a data structure is built that remembers the name and namespace of every element (and other nodes). The namespace prefixes used, including the empty prefix of the default namespace, are considered mere syntactic convenience. More on this below...

With Nokogiri I could just us the XPath //xmlns:item which works and returns all nodes from the feed.

Whatever that is, it's not XPath. Maybe it's a Nokogiri extension to it (a very convenient one, but its syntax is really counter-intuitive).

So I think I can phrase my question as: How can I select a node from the default namespace with HtmlUnit?

Let's phrase it as: How can I select the RSS item elements with HtmlUnit? I phrase it that way because the RSS spec (actually in general any conforming XML vocabulary spec) does not require that its elements will be in the default namespace. That happens to be true in the sample you received, but the service provider could change that tomorrow and still be perfectly conformant to RSS. Tomorrow, the service provider could use the "rss" namespace prefix for that namespace; or any other arbitrary prefix. What RSS does specify is what namespace its elements will be in: the namespace whose URI is http://purl.org/rss/1.0/.

It's kind of like asking, "How do I write a function (in Javascript, C, Java, etc.) that can tell me the value of the variable a?" Usually a function has no idea what variable name was used for what in the caller. All it knows are the values of its arguments. If you call sqrt(4), you'll get the same answer as with a = 4; sqrt(a) or rumpelstiltzkin = 4; sqrt(rumpelstiltzkin). Clearly, the name of the variable argument has no direct effect on the result of the function call. It just needs to be the name of a variable that holds the right value. If a compiler complained because you wrote b = 4; return sqrt(b) instead of using a, you'd think that compiler was nuts. It's not supposed to care about variable names as long as you use valid identifiers.

In the same way, when processing RSS, we're not supposed to care about what namespace prefix is used, as long as it's a prefix that identifies the right namespace. It could be no prefix (which identifies the default namespace).

In XPath 2.0, you can wildcard the namespace. This is very handy if you know you're not going to need namespaces for disambiguation. In that case you can select //*:item. However, I don't think HTMLUnit supports XPath 2.0. Also in XPath 2.0 environments like XSLT 2.0, you can specify a default namespace for XPath expressions, but that won't help you in HTMLUnit.

So you have a couple of choices:

  • Use an XPath expression that ignores namespaces, such as //*[local-name() = 'item'].

or

  • The robust way: Register a namespace prefix for http://purl.org/rss/1.0/ and use it in your XPath expression: //rss:item. The question then becomes, how do you register a namespace prefix in HTMLUnit and pass it to the XPath processor? I took a quick look in the docs and didn't find any facility for doing that.

Caveat: I should add that the above is in regard to conforming XPath processors. I have no idea what XPath processor HTMLUnit uses. There are some XPath processors out there that ignore the specs and make the world more confusing for everybody.

I saw here that someone used the following syntax for elements in the default namespace in HTMLUnit:

//:item

But I wouldn't recommend that, for three reasons:

  1. It's not valid XPath, so you can't expect it to work with other programs.

  2. It will only work on RSS feeds that declare the RSS namespace to be the default namespace. RSS feeds that use a namespace prefix will cause the above to fail.

  3. It will hold you back from learning how XML namespaces really work, and it will help preserve the status quo of tools that don't adequately support namespaces.

HTMLUnit is primarily designed for HTML, so incomplete handling of XML is understandable. But claiming to support XPath and then not providing ways to declare namespace prefixes is a bug. HTMLUnit uses an XPath package that seems to be part of Xalan-J. That package has ways to provide namespace mappings to XPath, but I don't know if HTMLUnit exposes that functionality.

这篇关于使用HtmlUnit在XPath中选择默认命名空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆