为什么 XPath 构造不干净?为什么谓词中不需要 text() ? [英] Why is XPath unclean constructed? Why is text() not needed in predicate?

查看:28
本文介绍了为什么 XPath 构造不干净?为什么谓词中不需要 text() ?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有:

<B>C</B><D>E</D></A>

然后我可以输出 B 元素(包括标签):

//B

哪个会返回

C

但为什么谓词中不需要 text()? 以下两行给出了相同的输出:

/A[B = 'C']/D/A[B/text() = 'C']/D

如果 XPATH 是干净构建的,我希望它会是(或在某种其他元素结构中):

/A[B = <B>C></B>]/D

和:

/A[B/text()='C']/D

谁能给我一个理由,为什么输出需要 text() 而谓词不需要它?

我认为这是一个合理且自然的问题.我宁愿看到人们提出这样的概念性问题,以了解 XPath 的工作原理,而不是满足于对 XPath 的肤浅理解,并最终提出一些肤浅的问题,即为什么他们的 XPath 表达式在从某个网络抓取数据时没有达到他们的预期页面.

让我们先澄清一些术语.通过输出",我假设您的意思与返回"相同:XPath 表达式选择的值.(XPath 本身没有直接输出能力.)我认为结构简洁"是指简单且一致的设计".

简而言之,XPath 是一致的,但与大多数灵活和强大的工具一样,它并不简单.

接下来,我们可能需要询问您正在考虑使用哪个版本的 XPath.版本 1、2 和 3 之间存在很大差异.我将重点介绍 XPath 1.0,因为它是最著名且实现最广泛的,我不知道 2.0 或 3.0.

B 无论是否在谓词中都意味着相同的事情.在 //B/A[B = 'C'] 中,它都是 节点测试.它匹配(选择)名为 B 的元素节点.XPath 对标签一无所知.它在抽象树文档模型上运行.XPath 表达式可以选择元素和其他节点,但不能选择标签.

所以我认为您的问题然后简化为,为什么 /A[B = 'C']/D 成功地选择了您的 XML 示例中的 D 元素提供,当 B 选择一个元素而不仅仅是文本 'C' 时?为了进一步减少它,为什么 B = 'C' 对元素 A 评估为 true,当 B 是一个元素而不仅仅是一个包含 'C' 的文本节点?

答案是,执行比较时,例如=,

<块引用>

如果要比较的一个对象是节点集,另一个是字符串,那么当且仅当节点中有一个节点时,比较才会为真节点集,使得执行比较的结果节点的字符串值,另一个字符串为真[强调].

换句话说,如果/A有多个名为B的子元素,子表达式B可以在这里选择多个元素节点.(在这种情况下,只有一个这样的子元素.)为了评估表达式 B = 'C',XPath 查看 选择的每个节点的 字符串值代码>B.根据文档

<块引用>

元素节点的字符串值是元素节点的所有文本节点后代的字符串值按文档顺序的串联.

在这种情况下,B 元素节点的唯一文本节点后代是字符串值为 'C' 的文本节点.因此 B 的字符串值是 'C',所以谓词 [B = 'C'] 对于元素 为真>/A.

为什么 XPath 会以这种方式定义元素节点的字符串值?我猜这部分是因为在单个文本节点的情况下很方便,但是当涉及到自由格式的标记文本时,比如

HTML that can;具有<b>任意<tt>嵌套</tt></b></p>

有时您出于某些目的想要忽略其标记,快速检索所有后代文本节点的串联非常方便.

你问题的另一部分是,你为什么不写

/A[B = C]/D

/A[B/text()='C']/D

第二个答案最短:你可以.它只是稍微不那么方便,功能也不那么强大,但它更加明确和精确.它不会一直给你相同的结果,因为这个版本不询问 B 的字符串值;它询问 (any) B 是否有 any 其值为 'C' 的文本节点子节点,而不是询问是否有任何 B 具有 所有后代文本节点的串联,产生 'C'.

对于 /A[B = <B>C</B>]/D,XPath(至少 1.0)没有设计用于创建新节点的语法,例如C.但即使是这样,B = <B>C</B> 是什么意思?您显然不是在要求身份比较,而是在要求一种结构对等.XPath 定义者必须创建一个比较语义,其中两个节点集之间的比较,或者一个节点集和一个新定义的类型(例如结构模板")之间的比较,当且仅当(例如)有(第一个)节点集中的节点递归匹配结构模板的结构,或第二个节点集中的节点.但是他们将其定义如下

<块引用>

如果要比较的两个对象都是节点集,那么当且仅当第一个节点集中的一个节点和第二个节点集中的一个节点使得执行的结果两个节点的字符串值比较为真.

鉴于他们只能选择两个定义中的一个来比较节点集,为什么他们选择了后者而不是您期望的定义?我不了解 XPath 委员会的议事程序,但我怀疑这归结为后一个定义更符合他们分析过的最常见用例,同时还考虑了性能和实现的简单性.

我同意这个定义不是定义 = 比较的最明显方式.但我认为设计者是对的,比较整个节点树结构并不是一个非常常见的用例,而 XPath 确实提供的工具很好地涵盖了常见用例(例如您提供的用例).例如,在 XPath 中询问是否有一个 A 元素是根节点的子元素,它有一个子 B 元素,其文本值 (暂时忽略所有子标记)是 'C'.

Assume I have:

<A>
  <B>C</B>
  <D>E</D>
</A>

Then I can output the B-element (including tags) with:

//B

Which will return

<B>C</B>

But why is text() not needed in a predicate? The following 2 lines give the same output:

/A[B = 'C']/D
/A[B/text() = 'C']/D

If XPATH was cleanly constructed I would expect it would be (or in some kind of other element structure):

/A[B = <B>C></B>]/D

and:

/A[B/text()='C']/D

Can someone give me a rationale why text() is needed for output, but it is not needed for predicates?

解决方案

I think it's a reasonable and natural question. I would rather see people asking conceptual questions like this, to understand how XPath works, than settle for a shallow understanding of XPath and end up asking shallow questions about why their XPath expression didn't do what they expected in scraping data from a certain web page.

Let's clear up some terms first. By "output", I assume you mean the same as "return": the value that an XPath expression selects. (XPath per se has no direct output capability.) By "cleanly constructed" I'm going to assume you mean "simply and consistently designed."

The short answer is that XPath is consistent, but like most flexible and powerful tools, it's not simple.

Next, we might need to ask which version of XPath you're thinking of. There are large differences between versions 1, 2, and 3. I will focus on XPath 1.0 because it's the most well-known and widely implemented, and I don't know 2.0 or 3.0 as well.

The B means the same thing whether it's in a predicate or not. Both in //B and in /A[B = 'C'], it's a node test. It matches (selects) element nodes named B. XPath knows nothing about tags. It operates on an abstract tree document model. An XPath expression can select elements and other nodes, but never tags.

So I think your question then reduces to, why does /A[B = 'C']/D succeed in selecting the D element in the XML sample you provided, when B selects an element rather than just the text 'C'? To reduce it further, why does B = 'C' evaluate as true for element A, when B is an element and not merely a text node containing 'C'?

The answer is, when performing comparisons such as =,

If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true [emphasis added].

In other words, the sub-expression B could select multiple element nodes here, if /A had multiple child elements named B. (In this case, there is only one such child element.) To evaluate the expression B = 'C', XPath looks at the string value of each node selected by B. According to the docs,

The string value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.

In this case, the only text node descendant of the B element node is the text node whose string-value is 'C'. Therefore the string-value of B is 'C', and so the predicate [B = 'C'] is true for element /A.

Why does XPath define the string value of an element node in this way? I'm guessing it's partly because of the convenience in the case of single text nodes, but when it comes to free-form marked-up text, like

<p>HTML that <em>could</em> have <b>arbitrary <tt>nesting</tt></b></p>

whose markup you sometimes want to ignore for certain purposes, it can be very handy to quickly retrieve the concatenation of all descendant text nodes.

The other part of your question was, why wouldn't you write

/A[B = <B>C</B>]/D

or

/A[B/text()='C']/D

The second one has the shortest answer: you can. It's just a little less convenient, and less powerful, but it is more explicit and precise. It wouldn't give you the same results all the time, because this version doesn't ask about the string-value of B; it asks whether (any) B has any text node child whose value is 'C', instead of asking whether any B has a concatenation of all descendant text nodes that yield 'C'.

As for /A[B = <B>C</B>]/D, XPath (1.0 at least) wasn't designed with a syntax for creating new nodes, such as <B>C</B>. But even if it were, what would B = <B>C</B> mean? You obviously aren't asking for an identity comparison but a sort of structural equivalence. The XPath definers would have to create a semantics of comparison where a comparison between two node-sets, or between a node-set and a newly defined type such as "structural template", is true if and only if (for example) there is a node in the (first) node-set that recursively matches the structure of the structural template, or of a node in the second node-set. But instead they defined it as follows,

If both objects to be compared are node-sets, then the comparison will be true if and only if there is a node in the first node-set and a node in the second node-set such that the result of performing the comparison on the string-values of the two nodes is true.

Given that they can only choose one of the two definitions for comparison of node-sets, why did they choose the latter instead of the definition you expected? I'm not privy to the proceedings of the XPath committee, but I suspect it came down to the latter definition being more in line with the most common use cases they had analyzed, with consideration also given to performance and simplicity of implementation.

I agree that this definition is not the most obvious way to define = comparison. But I think the designers were right, that comparing whole node tree structures is not a very common use case, whereas the common use cases (such as the one you gave) are well-covered by the tools that XPath does provide. For example, it's very simple in XPath to ask whether there is an A element that is a child of the root node, that has a child B element, whose text value (ignoring all sub-markup for the moment) is 'C'.

这篇关于为什么 XPath 构造不干净?为什么谓词中不需要 text() ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆