Scrapy选择器"a :: text"之间的区别和"a :: text" [英] Difference between Scrapy selectors "a::text" and "a ::text"
问题描述
我创建了一个刮板,以从网页中获取一些产品名称.运行正常.我已经使用CSS选择器来完成这项工作.但是,我唯一不了解的是选择器a::text
和a ::text
之间的区别(不要忽略后者中的a
和::text
之间的空间).运行脚本时,无论选择哪个选择器,我都会得到相同的结果.
I've created a scraper to grab some product names from a webpage. It is working smoothly. I've used CSS selectors to do the job. However, the only thing I can't understand is the difference between the selectors a::text
and a ::text
(don't overlook the space between a
and ::text
in the latter). When I run my script, I get the same exact result no matter which selector I choose.
import requests
from scrapy import Selector
res = requests.get("https://www.kipling.com/uk-en/sale/type/all-sale/?limit=all#")
sel = Selector(res)
for item in sel.css(".product-list-product-wrapper"):
title = item.css(".product-name a::text").extract_first().strip()
title_ano = item.css(".product-name a ::text").extract_first().strip()
print("Name: {}\nName_ano: {}\n".format(title,title_ano))
如您所见,title
和title_ano
都包含相同的选择器,并在后者中留出空格.尽管如此,结果始终是相同的.
As you can see, both title
and title_ano
contain the same selector, bar the space in the latter. Nevertheless, the results are always the same.
我的问题:两者之间有什么实质性区别?我应该何时使用前者?何时使用后者?
My question: is there any substantial difference between the two, and when should I use the former and when the latter?
推荐答案
有趣的观察!我花了几个小时来研究这个问题,事实证明,它所涉及的不仅仅是眼神.
Interesting observation! I spent the past couple of hours investigating this and it turns out, there's a lot more to it than meets the eye.
如果您来自CSS,则可能希望以与a::first-line
,a::first-letter
,a::before
或a::after
相同的方式编写a::text
.没有惊喜.
If you're coming from CSS, you'd probably expect to write a::text
in much the same way you'd write a::first-line
, a::first-letter
, a::before
or a::after
. No surprises there.
另一方面,标准选择器语法建议a ::text
匹配a
元素的后代的::text
伪元素,使其等效于a *::text
.但是,.product-list-product-wrapper .product-name a
没有任何子元素,因此按权利,a ::text
应该不匹配.它确实匹配的事实表明Scrapy没有遵循语法.
On the other hand, standard selector syntax would suggest that a ::text
matches the ::text
pseudo-element of a descendant of the a
element, making it equivalent to a *::text
. However, .product-list-product-wrapper .product-name a
doesn't have any child elements, so by rights, a ::text
is supposed to match nothing. The fact that it does match suggests that Scrapy is not following the grammar.
Scrapy使用Parsel(基于cssselect本身)将选择器转换为XPath,而::text
则来自XPath.考虑到这一点,让我们研究一下Parsel如何实现::text
:
Scrapy uses Parsel (itself based on cssselect) to translate selectors into XPath, which is where ::text
comes from. With that in mind, let's examine how Parsel implements ::text
:
>>> from parsel import css2xpath
>>> css2xpath('a::text')
'descendant-or-self::a/text()'
>>> css2xpath('a ::text')
'descendant-or-self::a/descendant-or-self::text()'
因此,像cssselect一样,跟随后代组合器的所有内容都将转换为descendant-or-self
轴,但是由于文本节点是DOM中元素节点的适当子代,因此::text
被视为独立节点并直接转换为text()
,它以descendant-or-self
轴与任何a
元素后代的文本节点匹配,就像a/text()
与a
的任何文本节点 child 匹配一样元素(孩子也是后代).
So, like cssselect, anything that follows a descendant combinator is translated into a descendant-or-self
axis, but because text nodes are proper children of element nodes in the DOM, ::text
is treated as a standalone node and converted directly to text()
, which, with the descendant-or-self
axis, matches any text node that is a descendant of an a
element, just as a/text()
matches any text node child of an a
element (a child is also a descendant).
令人震惊的是,即使在选择器中添加了显式的*
,也会发生这种情况:
Egregiously, this happens even when you add an explicit *
to the selector:
>>> css2xpath('a *::text')
'descendant-or-self::a/descendant-or-self::text()'
但是,使用descendant-or-self
轴意味着a ::text
可以匹配a
元素中的所有文本节点,包括嵌套在a
中的其他元素中的文本节点.在下面的示例中,a ::text
将匹配两个文本节点:'Link '
后跟'text'
:
However, the use of the descendant-or-self
axis means that a ::text
can match all text nodes in the a
element, including those in other elements nested within the a
. In the following example, a ::text
will match two text nodes: 'Link '
followed by 'text'
:
<a href="https://example.com">Link <span>text</span></a>
因此,尽管Scrapy对::text
的实现严重违反了Selectors语法,但这似乎是非常有意地做到的.
So while Scrapy's implementation of ::text
is an egregious violation of the Selectors grammar, it seems to have been done this way very much intentionally.
实际上,Scrapy的其他伪元素::attr()
1 的行为类似.当没有任何后代元素时,以下选择器都与属于div
元素的id
属性节点匹配:
In fact, Scrapy's other pseudo-element ::attr()
1 behaves similarly. The following selectors all match the id
attribute node belonging to the div
element when it does not have any descendant elements:
>>> css2xpath('div::attr(id)')
'descendant-or-self::div/@id'
>>> css2xpath('div ::attr(id)')
'descendant-or-self::div/descendant-or-self::*/@id'
>>> css2xpath('div *::attr(id)')
'descendant-or-self::div/descendant-or-self::*/@id'
...,但是div ::attr(id)
和div *::attr(id)
会匹配div
子代中的所有id
属性节点以及它自己的id
属性,例如以下示例:
... but div ::attr(id)
and div *::attr(id)
will match all id
attribute nodes within the div
's descendants along with its own id
attribute, such as in the following example:
<div id="parent"><p id="child"></p></div>
当然,这是一个不太合理的用例,因此必须怀疑这是否是实施::text
的意外副作用.
This, of course, is a much less plausible use case, so one has to wonder if this was an unintentional side effect of the implementation of ::text
.
比较伪元素选择器与将任何简单选择器替换为伪元素的选择器:
Compare the pseudo-element selectors to one that substitutes any simple selector for the pseudo-element:
>>> css2xpath('a [href]')
'descendant-or-self::a/descendant-or-self::*/*[@href]'
这可以使用附加的隐式child
轴正确地将后代组合器转换为descendant-or-self::*/*
,从而确保不会在a
元素上测试[@href]
谓词.
This correctly translates the descendant combinator to descendant-or-self::*/*
with an additional implicit child
axis, ensuring that the [@href]
predicate is never tested on the a
element.
如果您不熟悉XPath,选择器,甚至不熟悉Scrapy,这似乎都非常令人困惑和不知所措.因此,这是何时使用一个选择器而不是另一个选择器的摘要:
If you're new to XPath, Selectors, or even Scrapy, this may all seem very confusing and overwhelming. So here's a summary of when to use one selector over the other:
-
如果
a
元素仅包含文本,或者仅对此a
元素的顶级文本节点感兴趣,而对嵌套元素不感兴趣,请使用a::text
.
Use
a::text
if youra
element contains only text, or if you're only interested in the top-level text nodes of thisa
element and not its nested elements.
如果您的a
元素包含嵌套元素,并且您要提取此a
元素中的所有文本节点,请使用a ::text
.
Use a ::text
if your a
element contains nested elements and you want to extract all the text nodes within this a
element.
如果a ::text
,但其语法令人困惑,因此,为了保持一致,请使用a::text
.
While you can use a ::text
if your a
element contains only text, its syntax is confusing, so for the sake of consistency, use a::text
instead.
1 On an interesting note, ::attr()
appears in the Non-element Selectors spec, where as you'd expect it behaves consistently with the Selectors grammar, making its behavior in Scrapy inconsistent with the spec. ::text
on the other hand is conspicuously missing from the spec; based on this answer, I think you can make a reasonable guess as to why.
这篇关于Scrapy选择器"a :: text"之间的区别和"a :: text"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!