如何集成 Flask &刮痧? [英] How to integrate Flask & Scrapy?

查看:16
本文介绍了如何集成 Flask &刮痧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用scrapy获取数据,我想使用flask web框架在网页中显示结果.但是我不知道如何在flask应用程序中调用蜘蛛.我尝试使用 CrawlerProcess 来调用我的蜘蛛,但我得到了这样的错误:

I'm using scrapy to get data and I want to use flask web framework to show the results in webpage. But I don't know how to call the spiders in the flask app. I've tried to use CrawlerProcess to call my spiders,but I got the error like this :

ValueError
ValueError: signal only works in main thread

Traceback (most recent call last)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__
return self.wsgi_app(environ, start_response)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py", line 102, in index
process = CrawlerProcess()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 210, in __init__
install_shutdown_handlers(self._signal_shutdown)
File "/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py", line 21, in install_shutdown_handlers
reactor._handleSignals()
File "/Library/Python/2.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals
_SignalReactorMixin._handleSignals(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1154, in _handleSignals
signal.signal(signal.SIGINT, self.sigInt)
ValueError: signal only works in main thread

我的scrapy代码是这样的:

My scrapy code like this:

class EPGD(Item):

genID = Field()
genID_url = Field()
taxID = Field()
taxID_url = Field()
familyID = Field()
familyID_url = Field()
chromosome = Field()
symbol = Field()
description = Field()

class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = "man"
    start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]

db = DB_Con()
collection = db.getcollection(name, term)

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        self.collection.update({"genID":item['genID']}, dict(item),  upsert=True)
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

我的烧瓶代码是这样的:

My flask code like this:

@app.route('/', methods=['GET', 'POST'])
def index():
    process = CrawlerProcess()
    process.crawl(EPGD_spider)
    return redirect(url_for('details'))


@app.route('/details', methods = ['GET'])
def epgd():
    if request.method == 'GET':
        results = db['EPGD_test'].find()
        json_results= []
        for result in results:
            json_results.append(result)
        return toJson(json_results)

使用flask web框架时如何调用scrapy蜘蛛?

How can I call my scrapy spiders when using flask web framework?

推荐答案

在蜘蛛之前添加 HTTP 服务器并不是那么容易.有几个选项.

Adding HTTP server in front of your spiders is not that easy. There are couple of options.

如果您真的仅限于 Flask,如果您不能使用其他任何东西,那么将 Scrapy 与 Flask 集成的唯一方法是按照其他答案的建议为每个蜘蛛爬行启动外部进程(请注意,您的子进程需要在正确的 Scrapy 项目目录).

If you are really limited to Flask, if you can't use anything else, only way to integrate Scrapy with Flask is by launching external process for every spider crawl as other answer recommends (note that your subprocess needs to be spawned in proper Scrapy project directory).

所有示例的目录结构应如下所示,我正在使用 dirbot 测试项目

Directory structure for all examples should look like this, I'm using dirbot test project

> tree -L 1                                                                                                                                                              

├── dirbot
├── README.rst
├── scrapy.cfg
├── server.py
└── setup.py

这是在新进程中启动 Scrapy 的代码示例:

Here's code sample to launch Scrapy in new process:

# server.py
import subprocess

from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello_world():
    """
    Run spider in another process and store items in file. Simply issue command:

    > scrapy crawl dmoz -o "output.json"

    wait for  this command to finish, and read output.json to client.
    """
    spider_name = "dmoz"
    subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
    with open("output.json") as items_file:
        return items_file.read()

if __name__ == '__main__':
    app.run(debug=True)

上面保存为 server.py 并访问 localhost:5000,您应该可以看到抓取的项目.

Save above as server.py and visit localhost:5000, you should be able to see items scraped.

其他更好的方法是使用一些现有的项目,这些项目将 Twisted 与 Werkzeug 集成并显示类似于 Flask 的 API,例如Twisted-Klein.Twisted-Klein 允许您在与 Web 服务器相同的进程中异步运行蜘蛛.更好的是它不会阻塞每个请求,它允许您简单地从 HTTP 路由请求处理程序返回 Scrapy/Twisted 延迟.

Other, better way is using some existing project that integrates Twisted with Werkzeug and displays API similar to Flask, e.g. Twisted-Klein. Twisted-Klein would allow you to run your spiders asynchronously in same process as your web server. It's better in that it won't block on every request and it allows you to simply return Scrapy/Twisted deferreds from HTTP route request handler.

以下代码段将 Twisted-Klein 与 Scrapy 集成在一起,注意您需要创建自己的 CrawlerRunner 基类,以便爬虫收集项目并将它们返回给调用者.这个选项有点高级,你在与 Python 服务器相同的进程中运行 Scrapy 蜘蛛,项目不存储在文件中,而是存储在内存中(因此没有像前面的例子那样写/读磁盘).最重要的是它是异步的,并且都在一个 Twisted reactor 中运行.

Following snippet integrates Twisted-Klein with Scrapy, note that you need to create your own base class of CrawlerRunner so that crawler will collect items and return them to caller. This option is bit more advanced, you're running Scrapy spiders in same process as Python server, items are not stored in file but stored in memory (so there is no disk writing/reading as in previous example). Most important thing is that it's asynchronous and it's all running in one Twisted reactor.

# server.py
import json

from klein import route, run
from scrapy import signals
from scrapy.crawler import CrawlerRunner

from dirbot.spiders.dmoz import DmozSpider


class MyCrawlerRunner(CrawlerRunner):
    """
    Crawler object that collects items and returns output after finishing crawl.
    """
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        # create crawler (Same as in base CrawlerProcess)
        crawler = self.create_crawler(crawler_or_spidercls)

        # handle each item scraped
        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        # create Twisted.Deferred launching crawl
        dfd = self._crawl(crawler, *args, **kwargs)

        # add callback - when crawl is done cal return_items
        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    """
    :param output: items scraped by CrawlerRunner
    :return: json with list of items
    """
    # this just turns items into dictionaries
    # you may want to use Scrapy JSON serializer here
    return json.dumps([dict(item) for item in output])


@route("/")
def schedule(request):
    runner = MyCrawlerRunner()
    spider = DmozSpider()
    deferred = runner.crawl(spider)
    deferred.addCallback(return_spider_output)
    return deferred


run("localhost", 8080)

将上面的文件保存在 server.py 文件中,并在您的 Scrapy 项目目录中找到它,现在打开 localhost:8080,它将启动 dmoz 蜘蛛并将抓取的项目作为 json 返回到浏览器.

Save above in file server.py and locate it in your Scrapy project directory, now open localhost:8080, it will launch dmoz spider and return items scraped as json to browser.

当您尝试在蜘蛛前添加 HTTP 应用程序时会出现一些问题.例如,您有时需要处理蜘蛛日志(在某些情况下您可能需要它们),您需要以某种方式处理蜘蛛异常等.有些项目允许您以更简单的方式将 HTTP API 添加到蜘蛛,例如ScrapyRT.这是一个将 HTTP 服务器添加到您的 Scrapy 蜘蛛并为您处理所有问题的应用程序(例如处理日志记录、处理蜘蛛错误等).

There are some problems arising when you try to add HTTP app in front of your spiders. For example you need to handle spider logs sometimes (you may need them in some cases), you need to handle spider exceptions somehow etc. There are projects that allow you to add HTTP API to spiders in an easier way, e.g. ScrapyRT. This is an app that adds HTTP server to your Scrapy spiders and handles all the problems for you (e.g. handling logging, handling spider errors etc).

所以在安装 ScrapyRT 之后你只需要做:

So after installing ScrapyRT you only need to do:

> scrapyrt 

在你的 Scrapy 项目目录中,它会启动 HTTP 服务器来监听你的请求.然后您访问 http://localhost:9080/crawl.json?spider_name=dmoz&url=http://alfa.com 并且它应该为您抓取给定的 url 启动您的蜘蛛.

in your Scrapy project directory, and it will launch HTTP server listening for requests for you. You then visit http://localhost:9080/crawl.json?spider_name=dmoz&url=http://alfa.com and it should launch your spider for you crawling url given.

免责声明:我是 ScrapyRt 的作者之一.

Disclaimer: I'm one of the authors of ScrapyRt.

这篇关于如何集成 Flask &amp;刮痧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆