带孩子的 XPath 文本 [英] XPath text with children

查看:26
本文介绍了带孩子的 XPath 文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于此 html:

    <li>这是<a href="#">链接</a></li><li>这是<a href="#">另一个链接</a>.</li>

如何使用 XPath 得到以下结果:

<预><代码>['这是一个链接',这是另一个链接."]

我尝试过的:

//ul/li/text()

但是这给了我 ['This is ', 'This is .'] (没有 a 标签中的文本

还有:

string(//ul/li)

但这给了我['This is a link'](所以只有第一个元素)

还有

//ul/li/descendant-or-self::text()

但这给了我 ['This is ', 'a link', 'This is ', 'another link', '.']

还有什么想法吗?

解决方案

XPath 通常无法选择不存在的内容.您的 HTML 中不存在这些内容:

<预><代码>['这是一个链接',这是另一个链接."]

它们可能在概念上存在于更高的抽象级别,即浏览器对源代码的呈现,但严格来说,即使在那里它们也是分开的,例如颜色和功能.

在 DOM 级别上只有单独的文本节点,而 XPath 可以为您处理这些.

因此您有三个选择.

  1. 选择 text() 节点并将它们各自的值连接到 Python 代码中.
  2. 选择
  3. 元素,并为每个元素计算 string(.)normalize-space(.)破烂.normalize-space() 将按照您期望的方式处理空白.
  4. 选择
  5. 元素并访问它们的 .text 属性 - 该属性在内部查找所有后代文本节点并为您连接它们.

就我个人而言,我会选择后者,将 //ul/li 作为我的基本 XPath 表达式,因为这会产生更清晰的解决方案.

<小时>

正如@paul 在评论中指出的那样,Scrapy 提供了一个很好的流畅界面,可以在一行代码中执行多个处理步骤.以下代码实现了变体 #2:

selector = scrapy.Selector(text='''<ul><li>这是<a href="#">链接</a></li><li>这是<a href="#">另一个链接</a>.</li></ul>''')selector.css('ul > li').xpath('normalize-space()').extract()# -->[u'这是一个链接',你'这是另一个链接.']

Given this html:

<ul>
    <li>This is <a href="#">a link</a></li>
    <li>This is <a href="#">another link</a>.</li>
</ul>

How can I use XPath to get the following result:

[
    'This is a link',
    'This is another link.'
]

What I've tried:

//ul/li/text()

But this gives me ['This is ', 'This is .'] (withoug the text in the a tags

Also:

string(//ul/li)

But this gives me ['This is a link'] (so only the first element)

Also

//ul/li/descendant-or-self::text()

But this gives me ['This is ', 'a link', 'This is ', 'another link', '.']

Any further ideas?

解决方案

XPath generally cannot select what is not there. These things do not exist in your HTML:

[
    'This is a link',
    'This is another link.'
]

They might exist conceptually on the higher abstraction level that is the browser's rendering of the source code, but strictly speaking even there they are separate, for example in color and functionality.

On the DOM level there are only separate text nodes and that's all XPath can pick up for you.

Therefore you have three options.

  1. Select the text() nodes and join their individual values in Python code.
  2. Select the <li> elements and for each of them, evaluate string(.) or normalize-space(.) with Scrapy. normalize-space() would deal with whitespace the way you would expect it.
  3. Select the <li> elements and access their .text property – which internally finds all descendant text nodes and joins them for you.

Personally I would go for the latter with //ul/li as my basic XPath expression as this would result in a cleaner solution.


As @paul points out in the comments, Scrapy offers a nice fluent interface to do multiple processing steps in one line of code. The following code implements variant #2:

selector = scrapy.Selector(text='''<ul>
    <li>This is <a href="#">a link</a></li>
    <li>This is <a href="#">another link</a>.</li>
</ul>''')

selector.css('ul > li').xpath('normalize-space()').extract()
# --> [u'This is a link', u'This is another link.']

这篇关于带孩子的 XPath 文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆