Scrapy - 如何识别已经抓取的网址 [英] Scrapy - how to identify already scraped urls

查看：61 发布时间：2021/7/16 21:45:39 python web-crawler scrapy

本文介绍了Scrapy - 如何识别已经抓取的网址的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我每天都在使用scrapy 来抓取新闻网站.我如何限制scrapy 抓取已经抓取的URL.也有关于 SgmlLinkExtractor 的明确文档或示例.

Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLs. Also is there any clear documentation or examples on SgmlLinkExtractor.

推荐答案

实际上，您可以使用位于此处的 scrapy 代码段轻松完成此操作:http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

You can actually do this quite easily with the scrapy snippet located here: http://snipplr.com/view/67018/middleware-to-avoid-revisiting-already-visited-items/

要使用它，请从链接中复制代码并将其放入scrapy 项目中的某个文件中.要引用它，请在 settings.py 中添加一行以引用它:

To use it, copy the code from the link and put it into some file in your scrapy project. To reference it, add a line in your settings.py to reference it:

SPIDER_MIDDLEWARES = { 'project.middlewares.ignore.IgnoreVisitedItems': 560 }

您可以在此处阅读有关为什么选择数字的详细信息:http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

The specifics on WHY you pick the number that you do can be read up here: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html

最后，您需要修改您的 items.py，以便每个项目类都具有以下字段:

Finally, you'll need to modify your items.py so that each item class has the following fields:

visit_id = Field()
visit_status = Field()

我认为就是这样.下次运行蜘蛛时，它应该会自动尝试开始避开相同的站点.

And I think that's it. The next time you run your spider it should automatically try to start avoiding the same sites.

祝你好运！

这篇关于Scrapy - 如何识别已经抓取的网址的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy - 如何识别已经抓取的网址 [英] Scrapy - how to identify already scraped urls

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy - 如何识别已经抓取的网址 [英] Scrapy - how to identify already scraped urls

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭