在scrapy的start_requests()中返回项目 [英] Returning Items in scrapy's start_requests()
问题描述
我正在编写一个爬虫蜘蛛,它将许多 url 作为输入并将它们分类为类别(作为项目返回).这些 URL 通过我的爬虫的 start_requests()
方法提供给蜘蛛.
I am writing a scrapy spider that takes as input many urls and classifies them into categories (returned as items). These URLs are fed to the spider via my crawler's start_requests()
method.
有些网址不用下载就可以分类,所以我想在start_requests()
中直接yield
给他们一个Item
,其中被scrapy禁止.我怎样才能规避这种情况?
Some URLs can be classified without downloading them, so I would like to yield
directly an Item
for them in start_requests()
, which is forbidden by scrapy. How can I circumvent this?
我曾考虑在自定义中间件中捕获这些请求,将它们转换为虚假的 Response
对象,然后我可以在请求回调中将其转换为 Item
对象,但欢迎任何更清洁的解决方案.
I have thought about catching these requests in a custom middleware that would turn them into spurious Response
objects, that I could then convert into Item
objects in the request callback, but any cleaner solution would be welcome.
推荐答案
我认为使用蜘蛛中间件并覆盖 start_requests() 将是一个好的开始.
I think using a spider middleware and overwriting the start_requests() would be a good start.
在你的中间件中,你应该遍历 start_urls 中的所有 url,并且可以使用条件语句来处理不同类型的 url.
In your middleware, you should loop over all urls in start_urls, and could use conditional statements to deal with different types of urls.
- 对于不需要请求的特殊网址,您可以
- 直接调用您的管道的 process_item(),不要忘记导入您的管道并为此从您的 url 创建一个 scrapy.item
- 正如你所提到的,在请求中将 url 作为元传递,并有一个单独的解析函数,它只会返回 url
这篇关于在scrapy的start_requests()中返回项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!