如何整合Flask& Scrapy? [英] How to integrate Flask & Scrapy?

查看:3084
本文介绍了如何整合Flask& Scrapy?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用scrapy来获取数据,我想用flask web框架在网页上显示结果。但我不知道如何在烧瓶应用程序中调用蜘蛛。我尝试使用 CrawlerProcess 来调用我的蜘蛛,但是我得到了这样的错误:

  ValueError 
ValueError:信号只在主线程中有效

Traceback(最近一次调用最后一次)
文件/Library/Python/2.7/site- packages / flask / app.py,第1836行,在__call__
中返回self.wsgi_app(environ,start_response)
文件/Library/Python/2.7/site-packages/flask/app.py ,第1820行,在wsgi_app
response = self.make_response(self.handle_exception(e))
文件/Library/Python/2.7/site-packages/flask/app.py,第1403行,在handle_exception
reraise(exc_type,exc_value,tb)
文件/Library/Python/2.7/site-packages/flask/app.py,第1817行,在wsgi_app
response = self .full_dispatch_request()
文件/Library/Python/2.7/site-packages/flask/app.py,第1477行,在full_dispatch_request
rv = self.handle_user_exception(e)
文件/Library/Python/2.7/site-packages/flask/app.py,第1381行,在handle_user_exception
reraise(exc_type,exc_value,tb)
文件/Library/Python/2.7/site-packages/flask/app.py,1475行,在full_dispatch_request
rv = self.dispatch_request( )
文件/Library/Python/2.7/site-packages/flask/app.py,第1461行,在dispatch_request
返回self.view_functions [rule.endpoint](** req.view_args)
文件/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py,第102行,在索引
process = CrawlerProcess()
文件/Library/Python/2.7/site-packages /scrapy/crawler.py,第210行,在__init__
install_shutdown_handlers(self._signal_shutdown)
文件/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py,行21,在install_shutdown_handlers
reactor._handleSignals()
文件/Library/Python/2.7/site-packages/twisted/internet/posixbase.py,行295,在_handleSignals
_SignalReactorMixin。 _handleSignals(self)
文件/Library/Python/2.7/site-packages/twisted/internet/base.py,行1154,在_handleSignals
signal.signal signal.SIGINT,self.sigInt)
ValueError:signal只能在主线程中运行

My scrapy代码是这样的:

$ p code> class EPGD(Item):

genID = Field()
genID_url = Field()
taxID = Field()
taxID_url = Field()
familyID = Field()
familyID_url = Field()
chromosome = ()
symbol = Field()
description = Field()

class EPGD_spider(Spider):
name =EPGD
allowed_domains = epgd.biosino.org]
term =man
start_urls = [http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=\"+term+\"& ; submit = Feeling + Lucky]

db = DB_Con()
collection = db.getcollection(name,term)

def parse(self,response) :
sel = Selector(response)
sites = sel.xpath('// tr [@ class =odd] | // tr [@ class =even]')
url_list = []
base_url =http://epgd.biosino.org/EPGD

用于站点的站点:$ b​​ $ b it em = EPGD()
item ['genID'] = map(unicode.strip,site.xpath('td [1] / a / text()')。extract())
item [ 'genID_url'] = base_url + map(unicode.strip,site.xpath('td [1] / a / @ href')。extract())[0] [2:]
item ['taxID' ] = map(unicode.strip,site.xpath('td [2] / a / text()')。extract())
item ['taxID_url'] = map(unicode.strip,site.xpath ('td [2] / a / @ href')。extract())
item ['familyID'] = map(unicode.strip,site.xpath('td [3] / a / ').extract())
item ['familyID_url'] = base_url + map(unicode.strip,site.xpath('td [3] / a / @ href')。extract())[0] [2:]
item ['chromosome'] = map(unicode.strip,site.xpath('td [4] / text()')。extract())
item ['symbol' ] = map(unicode.strip,site.xpath('td [5] / text()')。extract())
item ['description'] = map(unicode.strip,site.xpath(' td [6] / text()')。extract())
self.collection.update({genID:item ['genID']},dict(i tem),upsert = True)
收益项目

sel_tmp =选择器(响应)
link = sel_tmp.xpath('// span [@ id =quickPage]' )

链接地址:
url_list.append(site.xpath('a / @ href')。extract())

for i in range (len(url_list [0])):
if cmp(url_list [0] [i],#)== 0:
if i + 1 < len(url_list [0]):
print url_list [0] [i + 1]
actual_url =http://epgd.biosino.org/EPGD/search/+ url_list [0] [ i + 1]
yield请求(actual_url,callback = self.parse)
break
else:
print索引超出范围!

我的代码如下:

<$ p ()
$ index $($)$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ b process.crawl(EPGD_spider)
return redirect(url_for('details'))


@ app.route('/ details',methods = ['GET' ])
def epgd():
if request.method =='GET':
results = db ['EPGD_test']。find()
json_results = []
表示结果:
json_results.append(结果)
返回toJson(json_results)

在使用flask web框架时,如何调用我的scrapy蜘蛛?解决方案

在前面添加HTTP服务器你的蜘蛛不是那么容易的有几个选项。

1。 Python subprocess



如果你真的只限于Flask,如果你不能使用其他任何东西,只有将Scrapy集成到Flask的方法是通过启动每个蜘蛛爬行的外部进程作为其他答案建议(请注意,您的子进程需要生成在适当的Scrapy项目目录中)。

所有示例的目录结构应该如下所示,我使用 dirbot测试项目

 >树-L 1 

├──dirbot
├──README.rst
├──scrapy.cfg
├──server.py
└──setup.py

以下是在新流程中启动Scrapy的代码示例:













$ b app = Flask(__ name__ )

@ app.route('/')
def hello_world():

在另一个进程中运行spider,发出命令:

> scrapy抓取dmoz -ooutput.json

等待这个命令完成,并读取output.json到客户端

spider_name =dmoz
subprocess.check_output(['scrapy','crawl',spider_name,-o,output.json])
打开output.json)作为items_file:
返回items_file.read()
$ b $如果__name__ =='__main__':
app.run(debug = True)

保存为server.py,然后访问localhost:5000,您应该可以看到被盗取的项目。

2 。 Twisted-Klein + Scrapy



其他更好的方法是使用一些现有项目,将Twisted与Werkzeug集成在一起,并显示与Flask类似的API。 Twisted-Klein 。 Twisted-Klein将允许你在与你的web服务器相同的进程中异步运行你的蜘蛛。更好的是,它不会阻塞每个请求,它允许你简单地从HTTP路由请求处理程序返回Scrapy / Twisted延迟。



以下片段集成了Twisted-Klein与Scrapy一样,请注意,您需要创建您自己的CrawlerRunner基类,以便抓取工具将收集项目并将其返回给调用者。这个选项有点高级,你在Python服务器上运行Scrapy蜘蛛程序,项目不存储在文件中,而是存储在内存中(所以没有像前面的例子那样写入/读取磁盘)。最重要的是它是异步的,它全部运行在一个Twisted reactor中。

 #server.py 
import json从klein导入路线

,从scrapy导入
导入信号$ scbbs.crawler
导入CrawlerRunner

from dirbot.spiders.dmoz import DmozSpider


class MyCrawlerRunner(CrawlerRunner):

抓取完成后收集物品并返回输出的抓取器对象

def crawl(self,crawler_or_spidercls,* args,** kwargs):
#保留所有项目被抓取
self.items = []

#创建爬虫CrawlerProcess)
crawler = self.create_crawler(crawler_or_spidercls)

#处理每个项目被抓取
crawler.signals.connect(self.item_scraped,signals.item_scraped)

#创建Twisted.Deferred启动爬行
dfd = self._crawl(crawler,* args,** kwargs)

#添加回调 - 抓取完成时cal_return_items
dfd.addCallback(self.return_items)
return dfd

def item_scraped(self,item,响应,蜘蛛):
self.items.append(item)
$ b $ return return_items(self,result):
return self.items


def return_spider_output(输出):

:param输出:CrawlerRunner抓取的项目
:返回:带项目列表的$ json

#这只是把项目变成字典
#你可能想在这里使用Scrapy JSON序列化程序
返回json.dumps([dict(item)for item in output])


@route(/)
def schedule(request):
runner = MyCrawlerRunner()
spider = DmozSpider()
deferred = runner.crawl(spider )
deferred.addCallback(return_spider_output)
返回延迟

$ b run(localhost,8080)
pre>

保存在文件中server.py并找到你的Scrapy项目目录,
现在打开localhost:8080,它将启动dmoz spider,并将返回的项目作为json返回给浏览器。



3。 ScrapyRT



当您尝试在您的蜘蛛前添加HTTP应用程序时,会出现一些问题。例如,有时候你需要处理蜘蛛日志(在某些情况下你可能需要它们),你需要以某种方式处理蜘蛛异常等等。有一些项目允许你以更简单的方式向蜘蛛添加HTTP API。 ScrapyRT 。这是一个应用程序,它将HTTP服务器添加到您的Scrapy蜘蛛并处理所有问题(例如处理日志记录,处理蜘蛛错误等)。

所以在安装ScrapyRT之后,您只需要做:

 > ;在您的Scrapy项目目录中,scrapyrt 

,它将启动HTTP服务器监听您的请求。然后,您可以访问 http:// localhost:9080 / crawl.json?spider_name = dmoz& amp ; url = http://alfa.com ,它应该启动您的蜘蛛为您抓取给定的网址。



免责声明:我是ScrapyRt的作者之一。


I'm using scrapy to get data and I want to use flask web framework to show the results in webpage. But I don't know how to call the spiders in the flask app. I've tried to use CrawlerProcess to call my spiders,but I got the error like this :

ValueError
ValueError: signal only works in main thread

Traceback (most recent call last)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__
return self.wsgi_app(environ, start_response)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py", line 102, in index
process = CrawlerProcess()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 210, in __init__
install_shutdown_handlers(self._signal_shutdown)
File "/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py", line 21, in install_shutdown_handlers
reactor._handleSignals()
File "/Library/Python/2.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals
_SignalReactorMixin._handleSignals(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1154, in _handleSignals
signal.signal(signal.SIGINT, self.sigInt)
ValueError: signal only works in main thread

My scrapy code like this:

class EPGD(Item):

genID = Field()
genID_url = Field()
taxID = Field()
taxID_url = Field()
familyID = Field()
familyID_url = Field()
chromosome = Field()
symbol = Field()
description = Field()

class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = "man"
    start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]

db = DB_Con()
collection = db.getcollection(name, term)

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        self.collection.update({"genID":item['genID']}, dict(item),  upsert=True)
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

My flask code like this:

@app.route('/', methods=['GET', 'POST'])
def index():
    process = CrawlerProcess()
    process.crawl(EPGD_spider)
    return redirect(url_for('details'))


@app.route('/details', methods = ['GET'])
def epgd():
    if request.method == 'GET':
        results = db['EPGD_test'].find()
        json_results= []
        for result in results:
            json_results.append(result)
        return toJson(json_results)

How can I call my scrapy spiders when using flask web framework?

解决方案

Adding HTTP server in front of your spiders is not that easy. There are couple of options.

1. Python subprocess

If you are really limited to Flask, if you can't use anything else, only way to integrate Scrapy with Flask is by launching external process for every spider crawl as other answer recommends (note that your subprocess needs to be spawned in proper Scrapy project directory).

Directory structure for all examples should look like this, I'm using dirbot test project

> tree -L 1                                                                                                                                                              

├── dirbot
├── README.rst
├── scrapy.cfg
├── server.py
└── setup.py

Here's code sample to launch Scrapy in new process:

# server.py
import subprocess

from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello_world():
    """
    Run spider in another process and store items in file. Simply issue command:

    > scrapy crawl dmoz -o "output.json"

    wait for  this command to finish, and read output.json to client.
    """
    spider_name = "dmoz"
    subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
    with open("output.json") as items_file:
        return items_file.read()

if __name__ == '__main__':
    app.run(debug=True)

Save above as server.py and visit localhost:5000, you should be able to see items scraped.

2. Twisted-Klein + Scrapy

Other, better way is using some existing project that integrates Twisted with Werkzeug and displays API similar to Flask, e.g. Twisted-Klein. Twisted-Klein would allow you to run your spiders asynchronously in same process as your web server. It's better in that it won't block on every request and it allows you to simply return Scrapy/Twisted deferreds from HTTP route request handler.

Following snippet integrates Twisted-Klein with Scrapy, note that you need to create your own base class of CrawlerRunner so that crawler will collect items and return them to caller. This option is bit more advanced, you're running Scrapy spiders in same process as Python server, items are not stored in file but stored in memory (so there is no disk writing/reading as in previous example). Most important thing is that it's asynchronous and it's all running in one Twisted reactor.

# server.py
import json

from klein import route, run
from scrapy import signals
from scrapy.crawler import CrawlerRunner

from dirbot.spiders.dmoz import DmozSpider


class MyCrawlerRunner(CrawlerRunner):
    """
    Crawler object that collects items and returns output after finishing crawl.
    """
    def crawl(self, crawler_or_spidercls, *args, **kwargs):
        # keep all items scraped
        self.items = []

        # create crawler (Same as in base CrawlerProcess)
        crawler = self.create_crawler(crawler_or_spidercls)

        # handle each item scraped
        crawler.signals.connect(self.item_scraped, signals.item_scraped)

        # create Twisted.Deferred launching crawl
        dfd = self._crawl(crawler, *args, **kwargs)

        # add callback - when crawl is done cal return_items
        dfd.addCallback(self.return_items)
        return dfd

    def item_scraped(self, item, response, spider):
        self.items.append(item)

    def return_items(self, result):
        return self.items


def return_spider_output(output):
    """
    :param output: items scraped by CrawlerRunner
    :return: json with list of items
    """
    # this just turns items into dictionaries
    # you may want to use Scrapy JSON serializer here
    return json.dumps([dict(item) for item in output])


@route("/")
def schedule(request):
    runner = MyCrawlerRunner()
    spider = DmozSpider()
    deferred = runner.crawl(spider)
    deferred.addCallback(return_spider_output)
    return deferred


run("localhost", 8080)

Save above in file server.py and locate it in your Scrapy project directory, now open localhost:8080, it will launch dmoz spider and return items scraped as json to browser.

3. ScrapyRT

There are some problems arising when you try to add HTTP app in front of your spiders. For example you need to handle spider logs sometimes (you may need them in some cases), you need to handle spider exceptions somehow etc. There are projects that allow you to add HTTP API to spiders in an easier way, e.g. ScrapyRT. This is an app that adds HTTP server to your Scrapy spiders and handles all the problems for you (e.g. handling logging, handling spider errors etc).

So after installing ScrapyRT you only need to do:

> scrapyrt 

in your Scrapy project directory, and it will launch HTTP server listening for requests for you. You then visit http://localhost:9080/crawl.json?spider_name=dmoz&url=http://alfa.com and it should launch your spider for you crawling url given.

Disclaimer: I'm one of the authors of ScrapyRt.

这篇关于如何整合Flask&amp; Scrapy?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆