scrapy xpath 选择器在 h1-h6 标签上的行为 [英] Behavior of the scrapy xpath selector on h1-h6 tags
问题描述
为什么以下两个代码片段给出了不同的输出?它们之间唯一的区别是第一种情况下的 h1
标签在第二种情况下被替换为 h
标签.这是因为 h1
标签在 html 中有特殊的意义"吗?我尝试使用 h1
到 h6
并且它们都将 []
作为输出,而使用 h7
它开始将 [u'xxx']
作为输出.
from scrapy import Selector # scrapy version: 1.2.2text = 'xxx
'打印选择器(text=text).xpath('//h1/p/text()').extract()输出[1]:[]text = 'xxx
'打印选择器(text=text).xpath('//h/p/text()').extract()输出[2]:[u'xxx']
根据 W3C,在 h#
中包含 p
标签是无效的.你可以看到更多关于这个 这里
无论如何,要绕过这个并使用任何 xml
结构,您可以像这样更改 type
:
sel = Selector(text="anyxml", type="xml")
这将尊重任何 xml 结构.
Why does the following two code snippets give different outputs? The only difference between them is that the h1
tag in the first case is replaced with an h
tag in the second case. Is this because the h1
tag has a special "meaning" in html? I tried with h1
through h6
and all of them give []
as output, while with h7
it starts to give [u'xxx']
as output.
from scrapy import Selector # scrapy version: 1.2.2
text = '<h1><p>xxx</p></h1>'
print Selector(text=text).xpath('//h1/p/text()').extract()
Output[1]: []
text = '<h><p>xxx</p></h>'
print Selector(text=text).xpath('//h/p/text()').extract()
Output[2]: [u'xxx']
Including p
tags inside h#
is invalid according to W3C. You can see more about this here
Anyway, to bypass this and just work with any xml
structure you can just change the type
like this:
sel = Selector(text="anyxml", type="xml")
This will respect any xml structure.
这篇关于scrapy xpath 选择器在 h1-h6 标签上的行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!