使用XPath从具有不必要名称空间的文档中提取XOM元素 [英] Using XPath to extract XOM elements from documents with unnecessary namespaces
问题描述
我正在尝试使用XOM解析由外部系统返回的一些HTML. HTML看起来像这样:
I'm trying to parse some HTML returned by an external system with XOM. The HTML looks like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<body>
<div>
Help I am trapped in a fortune cookie factory
</div>
</body>
</html>
(实际上,它实际上是比较杂乱的,但是它具有此DOCTYPE声明以及这些命名空间和语言声明,并且上面的HTML与实际的HTML存在相同的问题.)
(Actually it's significantly messier, but it has this DOCTYPE declaration and these namespace and language declarations, and the HTML above exhibits the same problem as the real HTML.)
我想做的是提取<div>
的内容,但是名称空间声明似乎使XPath感到困惑.如果我删除了命名空间声明(从文件中手动删除),则以下代码找到<div>
,没问题:
What I want to do is extract the content of the <div>
, but the namespace declaration seems to be confusing XPath. If I strip out the namespace declaration (by hand, from the file), the following code finds the <div>
, no problem:
Document document = ...
Nodes divs = document.query("//div");
但是使用命名空间,返回的Nodes
的大小为0.
But with the namespace, the returned Nodes
has a size of 0.
好吧,如果我以编程方式删除名称空间怎么办?
All right, how about if I strip the namespace programmatically?
Element rootElement = document.getRootElement();
rootElement.removeNamespaceDeclaration(rootElement.getNamespacePrefix());
...看起来应该可以,但是什么也没做.从 javadoc :
...looks like it should work, but does nothing. From the javadoc:
此方法仅删除随
addNamespaceDeclaration.
好的,我想,我将为查询提供名称空间:
Okay, I thought, I'll provide the namespace to the query:
XPathContext context =
XPathContext.makeNamespaceContext(document.getRootElement());
Nodes divs = document.query("//div", context);
大小仍然为零.
如何手动构造名称空间上下文?
How about constructing the namespace context by hand?
XPathContext context = context = new XPathContext(
rootElement.getNamespacePrefix(), rootElement.getNamespaceURI());
Nodes divs = document.query("//div", context);
XPathContext
构造函数爆炸:
nu.xom.NamespaceConflictException:
XPath expressions do not use the default namespace
因此,我正在寻找:
- 使该查询有效的一种方式,或
- 以编程方式剥离名称空间声明的方法,或
- 假设这两种方法都是错误的,对正确方法的解释.
更新:基于 Lev Levitsky的答案和
Update: Based on Lev Levitsky's answer and the Jaxen FAQ I came up with the following hack:
XPathContext context = new XPathContext(
"foo",
document.getRootElement().getNamespaceURI());
Nodes divs = document.query("//foo:div");
这似乎对我仍然有些痴迷,但我想这就是Jaxen想要您做事的方式.
This still seems a bit demented to me, but I guess it's the way Jaxen wants you to do things.
Update #2: As noted below and all over the Internet, this isn't Jaxen's fault; it's just XPath being XPath.
因此,尽管这种破解有效,但我仍然希望有一种剥离名称空间声明的方法.最好不要深入到XSLT.
So, while this hack works, I would still like a way to strip the namespace declaration. Preferably without going as far as XSLT.
推荐答案
您应该直接使用类似名称的名称空间指定
You should either specify the namespace directly with something like
Nodes divs = document.query("//{http://www.w3.org/1999/xhtml}div");
或使用映射到各自名称空间的前缀(我想这是NamespaceContext
的目的,但查询中没有前缀).
or using prefixes that are mapped to respective namespaces (I guess that is what NamespaceContext
is for, but there are no prefixes in your query).
不幸的是,我不知道它是如何用Java实现的,但是如果有帮助,我可以提供一个Python示例.
这篇关于使用XPath从具有不必要名称空间的文档中提取XOM元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!