scrapy LinkExtractors 最终会得到独特的链接吗? [英] Do scrapy LinkExtractors end up with unique links?

查看:46
本文介绍了scrapy LinkExtractors 最终会得到独特的链接吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我有一个页面,里面有很多文章和页码.现在,如果我想提取我使用的文章:

So, I have a page with a lot of articles and page numbers. Now if I want to extract an article I use:

Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')

对于页面,我使用此规则:规则(LinkExtractor(allow='page=\d+'))

for pages I use this Rule: Rule(LinkExtractor(allow='page=\d+'))

所以我最终得到了这些规则:

so I end up with these rules:

rules = [
    Rule(LinkExtractor(allow='page=\d+')),
    Rule(LinkExtractor(allow=['article\/.+\.html']), callback='parse_article')
]

我的问题是,我会得到重复的页面吗?例如,它会从第 1、2、4、5、6 页(直到第 3 页不再可见)中提取第 3 页并将其添加到提取的链接列表中吗?或者它只在它的末尾保留唯一的网址?

My question is, will I get repeated pages? as in, will it extract page 3 from page 1,2,4,5,6(till page 3 is no longer visible) and add it to the extracted link list? or it only keeps unique urls at the end of it?

推荐答案

默认情况下,LinkExtractor 应该只返回唯一的链接.有一个可选参数,unique,默认为True.

By default, LinkExtractor should only return unique links. There is an optional parameter, unique, which is True by default.

但这只能确保从每个页面提取的链接是唯一的.如果相同的链接出现在以后的页面上,则会再次提取.

But that only ensures the links extracted from each page are unique. If the same link occurs on a later page, it will be extracted again.

默认情况下,根据 DUPEFILTER_CLASS 设置.唯一需要注意的是,如果您停止并再次启动您的蜘蛛,则会重置访问过的 URL 的记录.查看文档中的作业:暂停和恢复抓取",了解如何在暂停和恢复蜘蛛时保留信息.

By default, your spider should automatically ensure it doesn't visit the same URLs again, according to the DUPEFILTER_CLASS setting. The only caveat to this is if you stop and start your spider again, the record of visited URLs is reset. Look at "Jobs: pausing and resuming crawls" in the documentation for how to persist information when you pause and resume a spider.

这篇关于scrapy LinkExtractors 最终会得到独特的链接吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆