Scrapy - 如何识别已经抓取的网址 [英] Scrapy - how to identify already scraped urls

查看:61
本文介绍了Scrapy - 如何识别已经抓取的网址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我每天都在使用scrapy 来抓取新闻网站.我如何限制scrapy 抓取已经抓取的URL.也有关于 SgmlLinkExtractor 的明确文档或示例.

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor.

推荐答案

实际上,您可以使用位于此处的 scrapy 代码段轻松完成此操作:http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

要使用它,请从链接中复制代码并将其放入scrapy 项目中的某个文件中.要引用它,请在 settings.py 中添加一行以引用它:

To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:

SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }

您可以在此处阅读有关为什么选择数字的详细信息:http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

最后,您需要修改您的 items.py,以便每个项目类都具有以下字段:

Finally, you'll need to modify your items.py so that each item class has the following fields:

visit_id = Field()
visit_status = Field()

我认为就是这样.下次运行蜘蛛时,它应该会自动尝试开始避开相同的站点.

And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.

祝你好运!

这篇关于Scrapy - 如何识别已经抓取的网址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆