带子项的XPath文本 [英] XPath text with children

查看:55
本文介绍了带子项的XPath文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出此html:

<ul>
    <li>This is <a href="#">a link</a></li>
    <li>This is <a href="#">another link</a>.</li>
</ul>

如何使用XPath获得以下结果:

How can I use XPath to get the following result:

[
    'This is a link',
    'This is another link.'
]

我尝试过的事情:

//ul/li/text()

但这给了我['This is ', 'This is .'](没有a标记中的文本

But this gives me ['This is ', 'This is .'] (withoug the text in the a tags

也:

string(//ul/li)

但这给了我['This is a link'](所以只有第一个元素)

But this gives me ['This is a link'] (so only the first element)

//ul/li/descendant-or-self::text()

但这给了我['This is ', 'a link', 'This is ', 'another link', '.']

还有其他想法吗?

推荐答案

XPath通常无法选择不存在的内容.这些内容在您的HTML中不存在:

XPath generally cannot select what is not there. These things do not exist in your HTML:

[
    'This is a link',
    'This is another link.'
]

从概念上讲,它们可能存在于更高的抽象级别上,即浏览器对源代码的呈现,但严格来说,即使它们是分开的,例如在颜色和功能上也是如此.

They might exist conceptually on the higher abstraction level that is the browser's rendering of the source code, but strictly speaking even there they are separate, for example in color and functionality.

在DOM级别上,只有单独的文本节点,这就是XPath可以为您服务的所有内容.

On the DOM level there are only separate text nodes and that's all XPath can pick up for you.

因此,您有三个选择.

Therefore you have three options.

  1. 选择text()节点并将其各个值连接到Python代码中.
  2. 选择<li>元素,并为每个元素使用Scrapy评估string(.)normalize-space(.). normalize-space()将以您期望的方式处理空白.
  3. 选择<li>元素并访问其.text属性-在内部查找所有后代文本节点并为您加入它们.
  1. Select the text() nodes and join their individual values in Python code.
  2. Select the <li> elements and for each of them, evaluate string(.) or normalize-space(.) with Scrapy. normalize-space() would deal with whitespace the way you would expect it.
  3. Select the <li> elements and access their .text property – which internally finds all descendant text nodes and joins them for you.

我个人会选择使用//ul/li作为后者的基本XPath表达式,因为这将导致更简洁的解决方案.

Personally I would go for the latter with //ul/li as my basic XPath expression as this would result in a cleaner solution.

正如@paul在评论中指出的那样,Scrapy提供了一个很好的流利界面,可以在一行代码中执行多个处理步骤.以下代码实现了变体#2:

As @paul points out in the comments, Scrapy offers a nice fluent interface to do multiple processing steps in one line of code. The following code implements variant #2:

selector = scrapy.Selector(text='''<ul>
    <li>This is <a href="#">a link</a></li>
    <li>This is <a href="#">another link</a>.</li>
</ul>''')

selector.css('ul > li').xpath('normalize-space()').extract()
# --> [u'This is a link', u'This is another link.']

这篇关于带子项的XPath文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆