在scrapy的start_requests()中返回项目 [英] Returning Items in scrapy's start_requests()

查看:57
本文介绍了在scrapy的start_requests()中返回项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个爬虫蜘蛛,它将许多 url 作为输入并将它们分类为类别(作为项目返回).这些 URL 通过我的爬虫的 start_requests() 方法提供给蜘蛛.

I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items). These URLs are fed to the spider via my crawler's start_requests() method.

有些网址不用下载就可以分类,所以我想在start_requests()中直接yield给他们一个Item,其中被scrapy禁止.我怎样才能规避这种情况?

Some URLs can be classified without downloading them, so I would like to yield directly an Item for them in start_requests(), which is forbidden by scrapy. How can I circumvent this?

我曾考虑在自定义中间件中捕获这些请求,将它们转换为虚假的 Response 对象,然后我可以在请求回调中将其转换为 Item 对象,但欢迎任何更清洁的解决方案.

I have thought about catching these requests in a custom middleware that would turn them into spurious Response objects, that I could then convert into Item objects in the request callback, but any cleaner solution would be welcome.

推荐答案

我认为使用蜘蛛中间件并覆盖 start_requests() 将是一个好的开始.

I think using a spider middleware and overwriting the start_requests() would be a good start.

在你的中间件中,你应该遍历 start_urls 中的所有 url,并且可以使用条件语句来处理不同类型的 url.

In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.

  • 对于不需要请求的特殊网址,您可以
    • 直接调用您的管道的 process_item(),不要忘记导入您的管道并为此从您的 url 创建一个 scrapy.item
    • 正如你所提到的,在请求中将 url 作为元传递,并有一个单独的解析函数,它只会返回 url

    这篇关于在scrapy的start_requests()中返回项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆