使用XPath从具有不必要名称空间的文档中提取XOM元素 [英] Using XPath to extract XOM elements from documents with unnecessary namespaces

查看：137 发布时间：2020/7/28 6:05:02 xpath xml-namespaces xom

本文介绍了使用XPath从具有不必要名称空间的文档中提取XOM元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用XOM解析由外部系统返回的一些HTML. HTML看起来像这样:

I'm trying to parse some HTML returned by an external system with XOM. The HTML looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<body>
  <div>
    Help I am trapped in a fortune cookie factory
  </div>
</body>
</html>

(实际上，它实际上是比较杂乱的，但是它具有此DOCTYPE声明以及这些命名空间和语言声明，并且上面的HTML与实际的HTML存在相同的问题.)

(Actually it's significantly messier, but it has this DOCTYPE declaration and these namespace and language declarations, and the HTML above exhibits the same problem as the real HTML.)

我想做的是提取<div>的内容，但是名称空间声明似乎使XPath感到困惑.如果我删除了命名空间声明(从文件中手动删除)，则以下代码找到<div>，没问题:

What I want to do is extract the content of the <div>, but the namespace declaration seems to be confusing XPath. If I strip out the namespace declaration (by hand, from the file), the following code finds the <div>, no problem:

Document document = ...
Nodes divs = document.query("//div");

但是使用命名空间，返回的Nodes的大小为0.

But with the namespace, the returned Nodes has a size of 0.

好吧，如果我以编程方式删除名称空间怎么办?

All right, how about if I strip the namespace programmatically?

Element rootElement = document.getRootElement();
rootElement.removeNamespaceDeclaration(rootElement.getNamespacePrefix());

...看起来应该可以，但是什么也没做.从 javadoc :

...looks like it should work, but does nothing. From the javadoc:

此方法仅删除随addNamespaceDeclaration.

好的，我想，我将为查询提供名称空间:

Okay, I thought, I'll provide the namespace to the query:

XPathContext context = 
    XPathContext.makeNamespaceContext(document.getRootElement());
Nodes divs = document.query("//div", context);

大小仍然为零.

如何手动构造名称空间上下文?

How about constructing the namespace context by hand?

XPathContext context = context = new XPathContext(
     rootElement.getNamespacePrefix(), rootElement.getNamespaceURI());
Nodes divs = document.query("//div", context);

XPathContext构造函数爆炸:

nu.xom.NamespaceConflictException: 
    XPath expressions do not use the default namespace

因此，我正在寻找:

使该查询有效的一种方式，或
以编程方式剥离名称空间声明的方法，或
假设这两种方法都是错误的，对正确方法的解释.

更新:基于 Lev Levitsky的答案和

Update: Based on Lev Levitsky's answer and the Jaxen FAQ I came up with the following hack:

XPathContext context = new XPathContext(
    "foo", 
    document.getRootElement().getNamespaceURI());
Nodes divs = document.query("//foo:div");

这似乎对我仍然有些痴迷，但我想这就是Jaxen想要您做事的方式.

This still seems a bit demented to me, but I guess it's the way Jaxen wants you to do things.

更新#2:如下所述，

Update #2: As noted below and all over the Internet, this isn't Jaxen's fault; it's just XPath being XPath.

因此，尽管这种破解有效，但我仍然希望有一种剥离名称空间声明的方法.最好不要深入到XSLT.

So, while this hack works, I would still like a way to strip the namespace declaration. Preferably without going as far as XSLT.

使用XPath从具有不必要名称空间的文档中提取XOM元素 [英] Using XPath to extract XOM elements from documents with unnecessary namespaces

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用XPath从具有不必要名称空间的文档中提取XOM元素 [英] Using XPath to extract XOM elements from documents with unnecessary namespaces

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭