scrapy xpath选择器在h1-h6标签上的行为 [英] Behavior of the scrapy xpath selector on h1-h6 tags
问题描述
以下两个代码段为什么给出不同的输出?它们之间的唯一区别是,在第二种情况下,第一种情况下的h1
标记被替换为h
标记.这是因为h1
标记在html中具有特殊的含义"吗?我尝试使用h1
到h6
,所有这些都将[]
作为输出,而对于h7
,它开始将[u'xxx']
作为输出.
Why does the following two code snippets give different outputs? The only difference between them is that the h1
tag in the first case is replaced with an h
tag in the second case. Is this because the h1
tag has a special "meaning" in html? I tried with h1
through h6
and all of them give []
as output, while with h7
it starts to give [u'xxx']
as output.
from scrapy import Selector # scrapy version: 1.2.2
text = '<h1><p>xxx</p></h1>'
print Selector(text=text).xpath('//h1/p/text()').extract()
Output[1]: []
text = '<h><p>xxx</p></h>'
print Selector(text=text).xpath('//h/p/text()').extract()
Output[2]: [u'xxx']
推荐答案
Including p
tags inside h#
is invalid according to W3C. You can see more about this here
无论如何,要绕过此方法并仅使用任何xml
结构,您可以像这样更改type
:
Anyway, to bypass this and just work with any xml
structure you can just change the type
like this:
sel = Selector(text="anyxml", type="xml")
这将尊重任何xml结构.
This will respect any xml structure.
这篇关于scrapy xpath选择器在h1-h6标签上的行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!