Scrapy/Python 从收益请求中获取项目 [英] Scrapy/Python getting items from yield requests

查看:51
本文介绍了Scrapy/Python 从收益请求中获取项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试请求多个页面并将回调中返回的变量存储到一个列表中,该列表将在以后的请求中使用.

I am trying to request multiple pages and store a returned variable from the callback into a list that will be used later in a future request.

def parse1(self,response):
    items.append(1)

def parse2(self,response):
    items=[]
    urls=['https://www.example1.com','https://www.example2.com']
    for url in urls:
        yield Request(
            url,
            callback=self.parse1,
            dont_filter=True
        )
    print items

如何实现?

Metas 没有帮助.他们输入而不是输出值,我想从请求循环中收集值.

Metas don't help. They input not output values and I want to collect values from a loop of requests.

推荐答案

这很可能是 Scrapy 或异步编程新手最常遇到的问题.(所以我会尝试提供更全面的答案.)

This is quite possibly the most often encountered issue for newcomers to Scrapy or async programming in general. (So I'll try for a more comprehensive answer.)

你想要做的是:

Response -> Response -> Response
   | <-----------------------'
   |                \-> Response
   | <-----------------------'
   |                \-> Response
   | <-----------------------'
aggregating         \-> Response
   V 
  Data out 

当你在异步编程中真正需要做的是链接你的响应/回调时:

When what you really have to do in async programming is this chaining of your responses / callbacks:

Response -> Response -> Response -> Response ::> Data out to ItemPipeline (Exporters)
        \-> Response -> Response -> Response ::> Data out to ItemPipeline
                    \-> Response -> Response ::> Data out to ItemPipeline
                     \> Response ::> Error

因此,我们需要在如何聚合数据方面进行范式转变.

So what's needed is a paradigm shift in thinking on how to aggregate your data.

将代码流视为时间线;你不能回到过去——或者及时返回结果——只能向前.您只能在安排时间时获得承诺一些未来要完成的工作.
因此,聪明的方法是将您在未来某个时间点需要的数据转发给自己.

Think of the code flow as a timeline; you can't go back in time - or return a result back in time - only forward. You can only get the promise of some future work to be done, at the time you schedule it.
So the clever way is to forward yourself the data that you'll be needing at that future point in time.

我认为的主要问题是这在 Python 中感觉和看起来都很尴尬,而它看起来在 JavaScript 等语言中更自然,但本质上是相同的.

The major problem I think is that this feels and looks awkward in Python, whereas it looks much more natural in languages like JavaScript, while it's essentially the same.

在 Scrapy 中可能更是如此,因为它试图向用户隐藏 Twisted deferred 的这种复杂性.

And that may be even more so the case in Scrapy, because it tries to hide this complexity of Twisted's deferreds from users.

但是您应该会在以下表示中看到一些相似之处:

But you should see some similarities in the following representations:

  • 随机 JS 示例:

  • Random JS example:

new Promise(function(resolve, reject) { // code flow
  setTimeout(() => resolve(1), 1000);   //  |
}).then(function(result) {              //  v
  alert(result);                        //  |
  return result * 2;                    //  |
}).then(function(result) {              //  |
  alert(result);                        //  |
  return result * 2;                    //  v
});

  • Twisted deferred 的样式:

  • Style of Twisted deferred's:


    (来源:https://twistedmatrix.com/documents/16.2.0/core/howto/defer.html#visual-explanation)

    Scrapy Spider 回调中的样式:

    Style in Scrapy Spider callbacks:

    scrapy.Request(url,
                   callback=self.parse, # > go to next response callback
                   errback=self.erred)  # > go to custom error callback
    

  • 那么,Scrapy 会给我们带来什么?

    边走边传你的数据,不要囤积它;)
    这在几乎所有情况下都应该足够了,除非您别无选择,只能合并来自多个页面的 Item 信息,但这些请求无法序列化为以下架构(稍后会详细介绍).

    Pass your data along as you go, don't hoard it ;)
    This should be sufficient in almost every case, except where you have no choice but to merge Item information from multiple pages, but where those Requests can't be serialized into the following schema (more on that later).

    ->- flow of data ---->---------------------->
    Response -> Response
               `-> Data -> Req/Response 
                   Data    `-> MoreData -> Yield Item to ItemPipeline (Exporters)
                   Data -> Req/Response
                           `-> MoreData -> Yield Item to ItemPipeline
     1. Gen      2. Gen        3. Gen
    

    您如何在代码中实现此模型将取决于您的用例.

    How you implement this model in code will depend on your use-case.

    Scrapy 在请求/响应中提供 meta 字段,用于拖拽数据.尽管有这个名字,但它并不是真正的元",而是非常重要的.不要逃避,习惯就好.

    Scrapy provides the meta field in Requests/Responses for slugging along data. Despite the name it's not really 'meta', but rather quite essential. Don't avoid it, get used to it.

    这样做可能看起来违反直觉,将所有数据堆积并复制到潜在的数千个新产生的请求中;但是由于 Scrapy 处理引用的方式,它实际上并不坏,并且旧对象会被 Scrapy 尽早清理.在上面的 ASCII 艺术中,当你的第二代请求全部排队时,第一代响应将被 Scrapy 从内存中释放,依此类推.因此,如果使用得当(并且不处理大量大文件),这并不是人们可能认为的真正内存膨胀.

    Doing that might seem counterintuitive, heaping along and duplicating all that data into potentially thousands newly spawned requests; but because of the way Scrapy handles references, it's not actually bad, and old objects get cleaned up early by Scrapy. In the above ASCII art, by the time your 2nd generation requests are all queued up, the 1st generation responses will be freed from memory by Scrapy, and so on. So this isn't really the memory-bloat one might think, if used correctly (and not handling lots of big files).

    元"的另一种可能性是实例变量(全局数据),用于在某些 self.data 中存储内容对象或其他对象,并将来从您的下一个响应回调中访问它.(从来没有在旧的,因为那个时候它还不存在.)执行此操作时,请始终记住它是全局共享数据;可能有并行"回调查看它.

    Another possibility to 'meta' are instance variables (global data), to store stuff in some self.data object or other, and access it in the future from your next response callback. (Never in the old one, since at that time it did not exist yet.) When doing this, remember always that it's global shared data, of course; which might have "parallel" callbacks looking at it.

    最后有时甚至可能使用外部资源,如 Redis-Queues 或套接字在 Spider 和数据存储之间通信数据(例如预填充 start_urls).

    And then finally sometimes one might even use external sources, like Redis-Queues or sockets to communicate data between Spider and a datastore (for example to pre-fill the start_urls).

    这在代码中怎么看?

    您可以编写递归"解析方法(实际上只是通过相同的回调方法汇集所有响应):

    You can write "recursive" parse methods (actually just funnel all responses through the same callback method):

    def parse(self, response):
        if response.xpath('//li[@class="next"]/a/@href').extract_first():
            yield scrapy.Request(response.urljoin(next_page_url)) # will "recurse" back to parse()
    
        if 'some_data' in reponse.body:
            yield { # the simplest item is a dict
                'statuscode': response.body.status,
                'data': response.body,
            }
    

    或者您可以拆分多个 parse 方法,每个方法处理特定类型的页面/响应:

    or you can split between multiple parse methods, each handling a specific type of page/Response:

    def parse(self, response):
        if response.xpath('//li[@class="next"]/a/@href').extract_first():
            request = scrapy.Request(response.urljoin(next_page_url))
            request.callback = self.parse2 # will go to parse2()
            request.meta['data'] = 'whatever'
            yield request
    
    def parse2(self, response):
        data = response.meta.get('data')
        # add some more data
        data['more_data'] = response.xpath('//whatever/we/@found').extract()
        # yield some more requests
        for url in data['found_links']:
            request = scrapy.Request(url, callback=self.parse3)
            request.meta['data'] = data # and keep on passing it along
            yield request
    
    def parse3(self, response):
        data = response.meta.get('data')
        # ...workworkwork...
        # finally, drop stuff to the item-pipelines
        yield data
    

    或者甚至像这样组合:

    def parse(self, response):
        data = response.meta.get('data', None)
        if not data: # we are on our first request
            if response.xpath('//li[@class="next"]/a/@href').extract_first():
                request = scrapy.Request(response.urljoin(next_page_url))
                request.callback = self.parse # will "recurse" back to parse()
                request.meta['data'] = 'whatever'
                yield request
            return # stop here
        # else: we already got data, continue with something else
        for url in data['found_links']:
            request = scrapy.Request(url, callback=self.parse3)
            request.meta['data'] = data # and keep on passing it along
            yield request
    

    但这对我来说真的不够好!

    最后,我们可以考虑使用这些更复杂的方法来处理流量控制,从而使那些讨厌的异步调用变得可预测:

    Finally, one can consider these more complex approaches, to handle flow control, so those pesky async calls become predictable:

    通过改变请求流强制序列化相互依赖的请求:

    def start_requests(self):
        url = 'https://example.com/final'
        request = scrapy.Request(url, callback=self.parse1)
        request.meta['urls'] = [ 
            'https://example.com/page1',
            'https://example.com/page2',
            'https://example.com/page3',
        ]   
        yield request
    
    def parse1(self, response):
        urls = response.meta.get('urls')
        data = response.meta.get('data')
        if not data:
            data = {}
        # process page response somehow
        page = response.xpath('//body').extract()
        # and remember it
        data[response.url] = page
    
        # keep unrolling urls
        try:
            url = urls.pop()
            request = Request(url, callback=self.parse1) # recurse
            request.meta['urls'] = urls # pass along
            request.meta['data'] = data # to next stage
            return request
        except IndexError: # list is empty
            # aggregate data somehow
            item = {}
            for url, stuff in data.items():
                item[url] = stuff
            return item
    

    另一个选项是 scrapy-inline-requests,但也要注意缺点(阅读项目自述文件).

    Another option for this are scrapy-inline-requests, but be aware of the downsides as well (read the project README).

    @inline_requests
    def parse(self, response):
        urls = [response.url]
        for i in range(10):
            next_url = response.urljoin('?page=%d' % i)
            try:
                next_resp = yield Request(next_url, meta={'handle_httpstatus_all': True})
                urls.append(next_resp.url)
            except Exception:
                self.logger.info("Failed request %s", i, exc_info=True)
    
        yield {'urls': urls}
    

    在实例存储中聚合数据(全局数据")并通过其中一个或两个处理流量控制

    Aggregate data in instance storage ("global data") and handle flow control through either or both

    • 调度程序请求优先级以强制执行命令或响应,因此我们可以希望在处理最后一个请求时,所有低优先级的内容都已完成.
    • 自定义pydispatch带外"信号通知.虽然这些并不是真正的轻量级,但它们是完全不同的层处理事件和通知.

    这是一种使用自定义请求优先级:

    custom_settings = {
        'CONCURRENT_REQUESTS': 1,
    }   
    data = {}
    
    def parse1(self, response):
        # prioritize these next requests over everything else
        urls = response.xpath('//a/@href').extract()
        for url in urls:
            yield scrapy.Request(url,
                                 priority=900,
                                 callback=self.parse2,
                                 meta={})
        final_url = 'https://final'
        yield scrapy.Request(final_url, callback=self.parse3)
    
    def parse2(self, response):
        # handle prioritized requests
        data = response.xpath('//what/we[/need]/text()').extract()
        self.data.update({response.url: data})
    
    def parse3(self, response):
        # collect data, other requests will have finished by now
        # IF THE CONCURRENCY IS LIMITED, otherwise no guarantee
        return self.data
    

    还有一个使用信号的基本示例.
    这会侦听内部 idle 事件,当 Spider 已抓取所有请求并处于正常状态时,使用它进行最后一秒清理(在这种情况下,聚合我们的数据).我们可以绝对肯定,此时我们不会遗漏任何数据.

    And a basic example using signals.
    This listens to the internal idle event, when the Spider has crawled all requests and is sitting pretty, to use it for doing last-second cleanup (in this case, aggregating our data). We can be absolutely certain that we won't be missing out on any data at this point.

    from scrapy import signals
    
    class SignalsSpider(Spider):
    
        data = {}
    
        @classmethod 
        def from_crawler(cls, crawler, *args, **kwargs):
            spider = super(Spider, cls).from_crawler(crawler, *args, **kwargs)
            crawler.signals.connect(spider.idle, signal=signals.spider_idle)
            return spider
    
        def idle(self, spider):
            if self.ima_done_now:
                return
            self.crawler.engine.schedule(self.finalize_crawl(), spider)
            raise DontCloseSpider
    
        def finalize_crawl(self):
            self.ima_done_now = True
            # aggregate data and finish
            item = self.data
            return item 
    
        def parse(self, response):
            if response.xpath('//li[@class="next"]/a/@href').extract_first():
                yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse2)
    
        def parse2(self, response):
            # handle requests
            data = response.xpath('//what/we[/need]/text()').extract()
            self.data.update({response.url: data})
    

    最后一种可能性是使用消息队列或 redis 等外部资源,正如已经提到的,来控制来自外部的蜘蛛流.这涵盖了我能想到的所有方式.

    A final possibility is using external sources like message-queues or redis, as already mentioned, to control the spider flow from outside. And that covers all the ways I can think of.

    一旦项目被产生/返回到引擎,它将被传递到 ItemPipelines(可以使用 Exporters - 不要与 Exporters 混淆代码>FeedExporters),在这里你可以继续按摩Spider之外的数据.自定义 ItemPipeline 实现可能会将项目存储在数据库中,或者对它们进行任意数量的特殊处理.

    Once an Item is yielded/returned to the Engine, it will be passed to the ItemPipelines (which can make use of Exporters - not to be confused with FeedExporters), where you can continue to massage the data outside the Spider. A custom ItemPipeline implementation might store the items in a database, or do any number of exotic processing things on them.

    希望这会有所帮助.

    (并且可以随意使用更好的文本或示例对其进行编辑,或者修复可能存在的任何错误.)

    (And feel free to edit this with better text or examples, or fix any errors there may be.)

    这篇关于Scrapy/Python 从收益请求中获取项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆