Scrapy - 如何识别已经抓取的网址 [英] Scrapy - how to identify already scraped urls
问题描述
我每天都在使用scrapy 来抓取新闻网站.我如何限制scrapy 抓取已经抓取的URL.也有关于 SgmlLinkExtractor
的明确文档或示例.
Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor
.
推荐答案
实际上,您可以使用位于此处的 scrapy 代码段轻松完成此操作:http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/
You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/
要使用它,请从链接中复制代码并将其放入scrapy 项目中的某个文件中.要引用它,请在 settings.py 中添加一行以引用它:
To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:
SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }
您可以在此处阅读有关为什么选择数字的详细信息:http://doc.scrapy.org/en/latest/topics/downloader-middleware.html
The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html
最后,您需要修改您的 items.py,以便每个项目类都具有以下字段:
Finally, you'll need to modify your items.py so that each item class has the following fields:
visit_id = Field()
visit_status = Field()
我认为就是这样.下次运行蜘蛛时,它应该会自动尝试开始避开相同的站点.
And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.
祝你好运!
这篇关于Scrapy - 如何识别已经抓取的网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!