Scrapy选择器"a :: text"之间的区别和"a :: text" [英] Difference between Scrapy selectors "a::text" and "a ::text"

查看:119
本文介绍了Scrapy选择器"a :: text"之间的区别和"a :: text"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个刮板,以从网页中获取一些产品名称.运行正常.我已经使用CSS选择器来完成这项工作.但是,我唯一不了解的是选择器a::texta ::text之间的区别(不要忽略后者中的a::text之间的空间).运行脚本时,无论选择哪个选择器,我都会得到相同的结果.

I've created a scraper to grab some product names from a webpage. It is working smoothly. I've used CSS selectors to do the job. However, the only thing I can't understand is the difference between the selectors a::text and a ::text (don't overlook the space between a and ::text in the latter). When I run my script, I get the same exact result no matter which selector I choose.

import requests
from scrapy import Selector

res = requests.get("https://www.kipling.com/uk-en/sale/type/all-sale/?limit=all#")
sel = Selector(res)
for item in sel.css(".product-list-product-wrapper"):
    title = item.css(".product-name a::text").extract_first().strip()
    title_ano = item.css(".product-name a ::text").extract_first().strip()
    print("Name: {}\nName_ano: {}\n".format(title,title_ano))

如您所见,titletitle_ano都包含相同的选择器,并在后者中留出空格.尽管如此,结果始终是相同的.

As you can see, both title and title_ano contain the same selector, bar the space in the latter. Nevertheless, the results are always the same.

我的问题:两者之间有什么实质性区别?我应该何时使用前者?何时使用后者?

My question: is there any substantial difference between the two, and when should I use the former and when the latter?

推荐答案

有趣的观察!我花了几个小时来研究这个问题,事实证明,它所涉及的不仅仅是眼神.

Interesting observation! I spent the past couple of hours investigating this and it turns out, there's a lot more to it than meets the eye.

如果您来自CSS,则可能希望以与a::first-linea::first-lettera::beforea::after相同的方式编写a::text.没有惊喜.

If you're coming from CSS, you'd probably expect to write a::text in much the same way you'd write a::first-line, a::first-letter, a::before or a::after. No surprises there.

另一方面,标准选择器语法建议a ::text匹配a元素的后代::text伪元素,使其等效于a *::text.但是,.product-list-product-wrapper .product-name a没有任何子元素,因此按权利,a ::text应该不匹配.它确实匹配的事实表明Scrapy没有遵循语法.

On the other hand, standard selector syntax would suggest that a ::text matches the ::text pseudo-element of a descendant of the a element, making it equivalent to a *::text. However, .product-list-product-wrapper .product-name a doesn't have any child elements, so by rights, a ::text is supposed to match nothing. The fact that it does match suggests that Scrapy is not following the grammar.

Scrapy使用Parsel(基于cssselect本身)将选择器转换为XPath,而::text则来自XPath.考虑到这一点,让我们研究一下Parsel如何实现::text:

Scrapy uses Parsel (itself based on cssselect) to translate selectors into XPath, which is where ::text comes from. With that in mind, let's examine how Parsel implements ::text:

>>> from parsel import css2xpath
>>> css2xpath('a::text')
'descendant-or-self::a/text()'
>>> css2xpath('a ::text')
'descendant-or-self::a/descendant-or-self::text()'

因此,像cssselect一样,跟随后代组合器的所有内容都将转换为descendant-or-self轴,但是由于文本节点是DOM中元素节点的适当子代,因此::text被视为独立节点并直接转换为text(),它以descendant-or-self轴与任何a元素后代的文本节点匹配,就像a/text()a的任何文本节点 child 匹配一样元素(孩子也是后代).

So, like cssselect, anything that follows a descendant combinator is translated into a descendant-or-self axis, but because text nodes are proper children of element nodes in the DOM, ::text is treated as a standalone node and converted directly to text(), which, with the descendant-or-self axis, matches any text node that is a descendant of an a element, just as a/text() matches any text node child of an a element (a child is also a descendant).

令人震惊的是,即使在选择器中添加了显式的*,也会发生这种情况:

Egregiously, this happens even when you add an explicit * to the selector:

>>> css2xpath('a *::text')
'descendant-or-self::a/descendant-or-self::text()'

但是,使用descendant-or-self轴意味着a ::text可以匹配a元素中的所有文本节点,包括嵌套在a中的其他元素中的文本节点.在下面的示例中,a ::text将匹配两个文本节点:'Link '后跟'text':

However, the use of the descendant-or-self axis means that a ::text can match all text nodes in the a element, including those in other elements nested within the a. In the following example, a ::text will match two text nodes: 'Link ' followed by 'text':

<a href="https://example.com">Link <span>text</span></a>

因此,尽管Scrapy对::text的实现严重违反了Selectors语法,但这似乎是非常有意地做到的.

So while Scrapy's implementation of ::text is an egregious violation of the Selectors grammar, it seems to have been done this way very much intentionally.

实际上,Scrapy的其他伪元素::attr() 1 的行为类似.当没有任何后代元素时,以下选择器都与属于div元素的id属性节点匹配:

In fact, Scrapy's other pseudo-element ::attr()1 behaves similarly. The following selectors all match the id attribute node belonging to the div element when it does not have any descendant elements:

>>> css2xpath('div::attr(id)')
'descendant-or-self::div/@id'
>>> css2xpath('div ::attr(id)')
'descendant-or-self::div/descendant-or-self::*/@id'
>>> css2xpath('div *::attr(id)')
'descendant-or-self::div/descendant-or-self::*/@id'

...,但是div ::attr(id)div *::attr(id)会匹配div子代中的所有id属性节点以及它自己的id属性,例如以下示例:

... but div ::attr(id) and div *::attr(id) will match all id attribute nodes within the div's descendants along with its own id attribute, such as in the following example:

<div id="parent"><p id="child"></p></div>

当然,这是一个不太合理的用例,因此必须怀疑这是否是实施::text的意外副作用.

This, of course, is a much less plausible use case, so one has to wonder if this was an unintentional side effect of the implementation of ::text.

比较伪元素选择器与将任何简单选择器替换为伪元素的选择器:

Compare the pseudo-element selectors to one that substitutes any simple selector for the pseudo-element:

>>> css2xpath('a [href]')
'descendant-or-self::a/descendant-or-self::*/*[@href]'

这可以使用附加的隐式child轴正确地将后代组合器转换为descendant-or-self::*/*,从而确保不会在a元素上测试[@href]谓词.

This correctly translates the descendant combinator to descendant-or-self::*/* with an additional implicit child axis, ensuring that the [@href] predicate is never tested on the a element.

如果您不熟悉XPath,选择器,甚至不熟悉Scrapy,这似乎都非常令人困惑和不知所措.因此,这是何时使用一个选择器而不是另一个选择器的摘要:

If you're new to XPath, Selectors, or even Scrapy, this may all seem very confusing and overwhelming. So here's a summary of when to use one selector over the other:

  • 如果a元素仅包含文本,或者仅对此a元素的顶级文本节点感兴趣,而对嵌套元素不感兴趣,请使用a::text.

  • Use a::text if your a element contains only text, or if you're only interested in the top-level text nodes of this a element and not its nested elements.

如果您的a元素包含嵌套元素,并且您要提取此a元素中的所有文本节点,请使用a ::text.

Use a ::text if your a element contains nested elements and you want to extract all the text nodes within this a element.

如果元素仅包含文本,则可以使用a ::text,但其语法令人困惑,因此,为了保持一致,请使用a::text.

While you can use a ::text if your a element contains only text, its syntax is confusing, so for the sake of consistency, use a::text instead.

1 有趣的是,::attr()出现在

1 On an interesting note, ::attr() appears in the Non-element Selectors spec, where as you'd expect it behaves consistently with the Selectors grammar, making its behavior in Scrapy inconsistent with the spec. ::text on the other hand is conspicuously missing from the spec; based on this answer, I think you can make a reasonable guess as to why.

这篇关于Scrapy选择器"a :: text"之间的区别和"a :: text"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆