scrapy LinkExtractors 最终会得到独特的链接吗? [英] Do scrapy LinkExtractors end up with unique links?
问题描述
所以,我有一个页面,里面有很多文章和页码.现在,如果我想提取我使用的文章:
So, I have a page with a lot of articles and page numbers. Now if I want to extract an article I use:
Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
对于页面,我使用此规则:规则(LinkExtractor(allow='page=\d+'))
for pages I use this Rule: Rule(LinkExtractor(allow='page=\d+'))
所以我最终得到了这些规则:
so I end up with these rules:
rules = [
Rule(LinkExtractor(allow='page=\d+')),
Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
]
我的问题是,我会得到重复的页面吗?例如,它会从第 1、2、4、5、6 页(直到第 3 页不再可见)中提取第 3 页并将其添加到提取的链接列表中吗?或者它只在它的末尾保留唯一的网址?
My question is, will I get repeated pages? as in, will it extract page 3 from page 1,2,4,5,6(till page 3 is no longer visible) and add it to the extracted link list? or it only keeps unique urls at the end of it?
推荐答案
默认情况下,LinkExtractor
应该只返回唯一的链接.有一个可选参数,unique
,默认为True
.
By default, LinkExtractor
should only return unique links. There is an optional parameter, unique
, which is True
by default.
但这只能确保从每个页面提取的链接是唯一的.如果相同的链接出现在以后的页面上,则会再次提取.
But that only ensures the links extracted from each page are unique. If the same link occurs on a later page, it will be extracted again.
默认情况下,根据 DUPEFILTER_CLASS
设置.唯一需要注意的是,如果您停止并再次启动您的蜘蛛,则会重置访问过的 URL 的记录.查看文档中的作业:暂停和恢复抓取",了解如何在暂停和恢复蜘蛛时保留信息.
By default, your spider should automatically ensure it doesn't visit the same URLs again, according to the DUPEFILTER_CLASS
setting. The only caveat to this is if you stop and start your spider again, the record of visited URLs is reset. Look at "Jobs: pausing and resuming crawls" in the documentation for how to persist information when you pause and resume a spider.
这篇关于scrapy LinkExtractors 最终会得到独特的链接吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!