通过 XPath 直接文本内容? [英] Direct text contents via XPath?

查看:28
本文介绍了通过 XPath 直接文本内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

///*/text()[string-length() >100]

...几乎可以工作,除了它还会选择

我想查找直接包含文本的元素,并且文本大于 140 个字符,并且应该选择整个元素的文本(有时文本位于 span 内部).

解决方案

你需要了解 XPath 中 text() 节点和字符串值的区别.

  • text() 选择 XPath 中的文本节点.br 元素显示在您的选择在父元素中形成混合内容:text()节点和元素混合在一起.
  • string() 是一个 XPath 函数,它返回 XPath 表达式的字符串值.要获得忽略 br 元素的字符串,请选择父 div 并通过 string() 直接获取其字符串值或通过在 a 中使用表达式隐式获取其字符串值隐含转换为字符串的上下文.

在这样的背景下,你的声明,

<块引用>

我想直接查找包含文本的元素,文本为大于 140 个字符,整个元素的文本应该是选择(有时文本在跨度内更远).

可以改写为

我想查找具有 text() 节点子项且字符串值长度大于 140 的元素.

让我们看一些示例 XML,

<a>这是一个<b>测试</b>混合内容.<c>asdf asdf asdf asdf</c><d>asdf asdf</d></r>

然后让我们将 140 减少到 8 以使其更易于管理,然后

//*[text()][string-length() >7]

捕获重新表述的需求并选择四个元素:

<a>这是一个<b>测试</b>混合内容.<c>asdf asdf asdf asdf</c><d>asdf asdf</d></r><a>这是一个<b>测试</b>混合内容.<c>asdf asdf asdf asdf</c><d>asdf asdf</d>

注意它没有选择b,因为它的字符串值的长度小于7个字符.

还要注意 r 被选中是因为元素之间只有空格 text().要消除此类元素,请向 text() 添加额外的谓词:

//*[text()[normalize-space()]][string-length() >7]

那么,只有acd会被选中.

如果你只想要文本,在 XPath 1.0 中你可以统一取字符串值:

string(///*[text()[normalize-space()]][string-length() > 7])

如果您想要一个字符串集合,在 XPath 1.0 中,您需要通过调用 XPath 的语言迭代元素,但在 XPath 2.0 中,您可以添加一个 string() 步骤最后:

//*[text()[normalize-space()]][string-length() >7]/字符串()

获取三个独立字符串的序列:

这是对混合内容的测试.asdf asdf asdf asdf asdf自卫队

//*/text()[string-length() > 100]

...almost works, except it also selects script and style tags in the html document, and it stops text selection as it encounters a <br> or other tag.

I want to find elements that contain text directly, and the text is greater than 140 chars and text for that entire element should be selected (sometimes the text is further inside span).

解决方案

You need to understand difference between text() nodes and string values in XPath.

  • text() selects text nodes in XPath. The br elements shown in your selection form mixed content in the parent element: text() nodes and elements mixed together.
  • string() is an XPath function that returns the string value of an XPath expression. To get a string that ignores the br elements, select the parent div and either directly take its string value via string() or implicitly get its string value by using the expression in a context where a conversion to string is implied.

With that background, your statement,

I want to find elements that contain text directly, and the text is greater than 140 chars and text for that entire element should be selected (sometimes the text is further inside span).

can be rephrased as

I want to find elements with text() node children and whose string value has a length greater than 140.

Let's look at some sample XML,

<r>
  <a>This is a <b>test</b> of mixed content.</a>
  <c>asdf asdf asdf asdf</c>
  <d>asdf asdf</d>
</r>

and let's reduce the 140 to 8 to make it more manageable, then

//*[text()][string-length() > 7]

captures the rephrased requirement and selects four elements:

<r>
  <a>This is a <b>test</b> of mixed content.</a>
  <c>asdf asdf asdf asdf</c>
  <d>asdf asdf</d>
</r>

<a>This is a <b>test</b> of mixed content.</a>

<c>asdf asdf asdf asdf</c>

<d>asdf asdf</d>

Notice that it did not select b because its string value's length is less than 7 characters.

Notice also that r is selected due to whitespace-only text() between the elements. To eliminate such elements, add an additional predicate to text():

//*[text()[normalize-space()]][string-length() > 7]

Then, only a, c, and d will be selected.

If you want text only, in XPath 1.0 you can collectively take the string value:

string(//*[text()[normalize-space()]][string-length() > 7])

If you want a collection of strings, in XPath 1.0, you'll need to iterate over the elements via the language calling XPath, but in XPath 2.0, you can add a string() step at the end:

//*[text()[normalize-space()]][string-length() > 7]/string()

to get a sequence of three separate strings:

This is a test of mixed content.
asdf asdf asdf asdf
asdf asdf

这篇关于通过 XPath 直接文本内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆