将请求传递给不同的蜘蛛 [英] Passing a request to a different spider

查看:44
本文介绍了将请求传递给不同的蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个使用 2 个不同蜘蛛的网络爬虫(使用 scrapy):

I'm working on a web crawler (using scrapy) that uses 2 different spiders:

  1. 非常通用的爬虫,可以使用一系列启发式方法抓取(几乎)任何网站来提取数据.
  2. 能够抓取特定网站 A 的专用蜘蛛,由于网站的特殊结构(必须抓取该网站),因此无法使用通用蜘蛛进行抓取.

到目前为止一切正常,但网站 A 包含指向其他普通"网站的链接,这些网站也应该被抓取(使用蜘蛛 1).是否有一种糟糕的方式将请求传递给蜘蛛 1?

Everything works nicely so far but website A contains links to other, "ordinary" websites that should be scraped too (using spider 1). Is there a Scrappy way to pass the request to spider 1?

我想到的解决方案:

  1. 将所有功能移至spider 1.但这可能会变得非常混乱,spider 1 代码已经非常长且复杂,如果可能,我希望将此功能分开.
  2. Pass 中所建议的那样,将链接保存到数据库中URL从一个蜘蛛到另一个

有没有更好的办法?

推荐答案

我遇到过这样的情况,一个 spyder 在第一页中检索 URL 地址,然后从那里调用第二个进行操作.
我不知道您的控制流是什么,但根据它,我只会在抓取新 url 时或在抓取所有可能的 url 后及时调用第一个 spyder.
您是否遇到过 n°2 可以检索同一个网站的 URL 的情况?在这种情况下,我将存储所有 url,将它们作为列表排序为任一蜘蛛的 dict 中的列表,然后再次滚动,直到列表中没有新元素可供探索.在我看来,这使它变得更好,因为它更灵活.

I met such a case, with a spyder retrieving in a first page the URL adresses and the second one being called from there to operate.
I don't know what is your control flow, but depending on it, I would merely call the first spyder just in time when scrapping a new url, or after scrapping all possible url.
Do you have the case where n°2 can retrieve URLs for the very same website? In this case, I would store all urls, sort them as list in a dict for either spider, and roll this again until there are not new element left to the lists to explore. That makes it better as it is more flexible, in my opinion.

及时调用可能没问题,但根据您的流程,它可能会降低性能,因为对同一函数的多次调用可能会浪费大量时间来初始化事物.

Calling just in time might be ok, but depending on your flow, it could make performance poor as multiple calls to the same functions will probably lose lots of time initializing things.

您可能还希望使分析功能独立于蜘蛛,以便在您认为合适的情况下将它们提供给两者.如果你的代码很长很复杂,它可能有助于让它更轻、更清晰.我知道这样做并不总是可以避免的,但这可能值得一试,而且您最终可能会在代码级别提高效率.

You might also want to make analytical functions independent of the spider in order to make them available to both as you see fit. If your code is very long and complicated, it might help making it lighter and clearer. I know it is not always avoidable to do so, but that might be worth a try and you might end up being more efficient at code level.

这篇关于将请求传递给不同的蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆