为 Scrapy 构建 RESTful Flask API [英] Building a RESTful Flask API for Scrapy

查看:33
本文介绍了为 Scrapy 构建 RESTful Flask API的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

API 应该允许包含用户想要抓取的 URL 的任意 HTTP get 请求,然后 Flask 应该返回抓取的结果.

以下代码适用于第一个http请求,但twisted reactor停止后,它不会重新启动.我什至可能没有以正确的方式解决这个问题,但我只是想在 Heroku 上放置一个 RESTful scrapy API,到目前为止我能想到的就是这些.

有没有更好的方法来构建这个解决方案?或者我怎样才能让 scrape_it 在不停止扭曲的反应器(无法再次启动)的情况下返回?

from flask import Flask导入操作系统导入系统导入json从 n_grams.spiders.n_gram_spider 导入 NGramsSpider#scrapy api从twisted.internet 进口反应堆导入scrapy从 scrapy.crawler 导入 CrawlerRunner从scrapy.xlib.pydispatch 导入调度器从scrapy导入信号app = Flask(__name__)def scrape_it(url):项目 = []def add_item(item):items.append(item)跑步者 = CrawlerRunner()d = runner.crawl(NGramsSpider, [url])d.addBoth(lambda _: reactor.stop()) # <<<问题在这里???dispatcher.connect(add_item, 信号=signals.item_passed)reactor.run(installSignalHandlers=0) # 脚本会在这里阻塞,直到爬取完成退换货品@app.route('/scrape/')定义刮(网址):ret = scrape_it(url)返回 json.dumps(ret, ensure_ascii=False, encoding='utf8')如果 __name__ == '__main__':PORT = os.environ['PORT'] 如果 os.environ 中的 'PORT' else 8080app.run(debug=True, host='0.0.0.0', port=int(PORT))

解决方案

我认为没有什么好的方法可以为 Scrapy 创建基于 Flask 的 API.Flask 不是一个合适的工具,因为它不是基于事件循环.更糟糕的是,Twisted reactor(Scrapy 使用的)不能多次启动/停止单线程.

假设 Twisted reactor 没有问题,您可以启动和停止它.它不会让事情变得更好,因为您的 scrape_it 函数可能会阻塞很长时间,因此您将需要许多线程/进程.

我认为要走的路是使用 Twisted 或 Tornado 等异步框架创建 API;它将比基于 Flask(或基于 Django)的解决方案更高效,因为 API 将能够在 Scrapy 运行蜘蛛时处理请求.

Scrapy 基于 Twisted,所以使用twisted.web 或 https://github.com/twisted/klein 都可以更直接.但是 Tornado 也不难,因为你可以让它使用 Twisted 事件循环.

有一个名为 ScrapyRT 的项目,它的功能与您想要实现的非常相似——它是 Scrapy 的 HTTP API.ScrapyRT 基于 Twisted.

作为 Scrapy-Tornado 集成的示例,检查 Arachnado - 这里是关于如何将 Scrapy 的 Crawler 应用程序与 Tornp 集成的示例.

如果你真的想要基于 Flask 的 API,那么在单独的进程中开始爬行和/或使用像 Celery 这样的队列解决方案是有意义的.这样你就失去了 Scrapy 的大部分效率;如果你这样做,你也可以使用 requests + BeautifulSoup.

The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.

The following code works for the first http request, but after twisted reactor stops, it won't restart. I may not even be going about this the right way, but I just want to put a RESTful scrapy API up on Heroku, and what I have so far is all I can think of.

Is there a better way to architect this solution? Or how can I allow scrape_it to return without stopping twisted reactor (which can't be started again)?

from flask import Flask
import os
import sys
import json

from n_grams.spiders.n_gram_spider import NGramsSpider

# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

app = Flask(__name__)


def scrape_it(url):
    items = []
    def add_item(item):
        items.append(item)

    runner = CrawlerRunner()

    d = runner.crawl(NGramsSpider, [url])
    d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???

    dispatcher.connect(add_item, signal=signals.item_passed)

    reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished


    return items

@app.route('/scrape/<path:url>')
def scrape(url):

    ret = scrape_it(url)

    return json.dumps(ret, ensure_ascii=False, encoding='utf8')


if __name__ == '__main__':
    PORT = os.environ['PORT'] if 'PORT' in os.environ else 8080

    app.run(debug=True, host='0.0.0.0', port=int(PORT))

解决方案

I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.

Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.

I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.

Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.

There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.

As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.

If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.

这篇关于为 Scrapy 构建 RESTful Flask API的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆