Scrapy and celery`update_state` [英] Scrapy and celery `update_state`

查看：269 发布时间：2020/7/6 6:51:13 python scrapy celery scrapy-pipeline

本文介绍了Scrapy and celery`update_state`的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我具有以下设置(Docker):

I have the following setup (Docker):

芹菜链接到运行Scrapy蜘蛛的Flask设置
烧瓶设置(显然)
烧瓶安装程序收到对Scrapy的请求->解雇工作人员来做一些工作

现在，我想根据芹菜工人的进度更新原始烧瓶的设置. 但是目前无法在刮板内部使用celery.update_state()，因为它无法访问原始任务(尽管它正在celery任务内部运行).

Now I wish to update the original flask setup on the progress of the celery worker. BUT there is no way right now to use celery.update_state() inside of the scraper as it has no access to the original task (though it is being run inside of the celery task).

顺便说一句:我是否遗漏了scrapy的结构?可以在__init__内分配参数以使用进阶指令似乎是合理的，但scrapy将该方法用作lambda函数.

As an aside: am i missing something about the structure of scrapy? It would seem reasonable that I can assign arguments inside of __init__ to be able to use furtheron, but scrapy uses the method as lambda functions it seems..

要回答一些问题:

How are you using celery with scrapy? Scrapy在celery任务内部运行，而不是从命令行运行.我也从未听说过scrapyd，这是一个草率的子项目吗?我使用远程工作者从celery/flask实例内部解雇了scrapy，因此它与原始请求所引用的线程不同，它们是单独的docker实例.

How are you using celery with scrapy? Scrapy is running inside of a celery task, not run from the command line. I also have never heard of scrapyd, is this a subproject of scrapy? I use a remote worker to fire off scrapy from inside of a celery/flask instance, so it is not the same as the thread being intanced by the original request, they are seperate docker instances.

task.update_state效果很好！在芹菜任务中，但是一旦我们进入"蜘蛛，我们就不再可以使用芹菜.有什么想法吗?

The task.update_state works great! inside of the celery task, but as soon as we are 'in' the spider, we no longer have access to celery. Any ideas?

从item_scraped信号发出Task.update_state(taskid，meta = {}).如果scrapy恰好在Celery任务本身中运行(默认为self)，您也可以不使用taskid来运行

From the item_scraped signal issue Task.update_state(taskid,meta={}). You can also run without the taskid if scrapy happens to be running in a Celery task itself (as it defaults to self)

这是访问当前芹菜任务的一种静态方法吗?如我所愿......

Is this sort of like a static way of accessing the current celery task? As I would love that....

推荐答案

我不确定您如何射击蜘蛛，但是我也遇到了与您描述的相同的问题.

I'm not sure how you are firing your spiders, but i've faced the same issue you describe.

我的设置是将Flask作为休息api，可根据要求触发芹菜任务以启动蜘蛛.我还没有编写代码，但是我会告诉你我在想什么:

My setup is flask as a rest api, which upon requests fires celery tasks to start spiders. I havent gotten to code it yet, but I'll tell you what i was thinking of doing:

from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy import signals
from .your_celery import app



@app.task(bind=True)
def scrapping(self):

    def my_item_scrapped_handler(item, response, spider):
        meta = {
            # fill your state meta as required based on scrapped item, spider, or response object passed as parameters
        }

        # here self refers to the task, so you can call update_state when using bind
        self.update_state(state='PROGRESS',meta=meta)

    settings = get_project_settings()
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

    runner = CrawlerRunner(settings)
    d = runner.crawl(MySpider)
    d.addBoth(lambda _: reactor.stop())

    for crawler in runner.crawlers:
        crawler.signals.connect(my_item_scrapped_handler, signal=signals.item_scraped)


    reactor.run()

很抱歉，我无法确认它是否有效，但是一旦我开始对其进行测试，我将在这里报告！我目前无法为这个项目投入尽可能多的时间:(

I'm sorry for not being able to confirm if it works, but as soon as I get around to testing it I'll report back here! I currently can't dedicate as much time as I would like to to this project :(

如果您认为我可以进一步帮助您，请随时与我联系！

Do not hesitate to contact me if you think I can help you any further!

干杯，拉米罗

来源:

CrawlerRunner搜寻器方法: https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerRunner.crawlers
Celery任务文档:
- 绑定任务: http://docs.celeryproject. org/en/latest/userguide/tasks.html#bound-tasks
- 自定义状态: http://docs.celeryproject.org/en/latest/userguide/tasks.html#custom-states
- CrawlerRunner crawlers method: https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerRunner.crawlers
- Celery tasks docs:
  - Bound Tasks: http://docs.celeryproject.org/en/latest/userguide/tasks.html#bound-tasks
  - Custom states: http://docs.celeryproject.org/en/latest/userguide/tasks.html#custom-states
  这篇关于Scrapy and celery`update_state`的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy and celery`update_state` [英] Scrapy and celery `update_state`

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy and celery`update_state` [英] Scrapy and celery `update_state`

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭