从Django视图启动Scrapy [英] Starting Scrapy from a Django view

查看:235
本文介绍了从Django视图启动Scrapy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Scrapy的体验是有限的,每次使用它时,总是通过终端的命令进行。如何从django模板中获取表单数据(要抓取的网址),以便与scrapy通信以开始抓取?到目前为止,我只想到从django的视图中获取表单的返回数据,然后尝试进入scrapy目录中的spider.py,以将表单数据的url添加到spider的start_urls中。从那里开始,我真的不知道如何触发实际的抓取,因为我习惯于严格通过终端使用 scrapy crawl dmoz之类的命令来进行抓取。谢谢。

My experience with Scrapy is limited, and each time I use it, it's always through the terminal's commands. How can I get my form data (a url to be scraped) from my django template to communicate with scrapy to start doing scraping? So far, I've only thought of is to get the form's returned data from django's views and then try to reach into the spider.py in scrapy's directory to add the form data's url to the spider's start_urls. From there, I don't really know how to trigger the actual crawling since I'm used to doing it strictly through my terminal with commands like "scrapy crawl dmoz". Thanks.

微小的编辑:刚刚发现scrapyd ...我想我可能会朝着正确的方向前进。

tiny edit: Just discovered scrapyd... I think I may be headed in the right direction with this.

推荐答案

您实际上已经通过编辑回答了它。最好的选择是设置 scrapyd 服务,并调用 schedule.json 触发抓取作业运行。

You've actually answered it with an edit. The best option would be to setup scrapyd service and make an API call to schedule.json to trigger a scraping job to run.

要进行API http调用,您可以使用 urllib2 / 请求,或在 scrapyd API周围使用包装器- python-scrapyd-api

To make that API http call, you can either use urllib2/requests, or use a wrapper around scrapyd API - python-scrapyd-api:

from scrapyd_api import ScrapydAPI

scrapyd = ScrapydAPI('http://localhost:6800')
scrapyd.schedule('project_name', 'spider_name')






如果我们抛开 scrapyd ,并尝试运行蜘蛛从视图中,它会阻止请求,直到扭曲的反应堆停止为止-因此,这实际上不是一种选择。


If we put aside scrapyd and try to run the spider from the view, it will block the request until the twisted reactor would stop - therefore, it is not really an option.

不过,您可以开始使用 芹菜 (串联与 django_celery )-定义一个任务,该任务将运行Scrapy蜘蛛并从django视图中调用该任务。这样,您可以将任务放在队列中,而不会让用户等待爬网完成。

You can though, start using celery (in tandem with django_celery) - define a task that would run your Scrapy spider and call the task from your django view. This way, you would put the task on the queue and would not have a user waiting for crawling to be finished.

另外,请查看 django-dynamic-scraper 包:


Django Dynamic Sc​​raper(DDS)是在
抓取框架Scrapy之上构建Django的应用程序。在保留
Scrapy的许多功能的同时,它还允许您通过
Django管理界面动态创建和管理蜘蛛。

Django Dynamic Scraper (DDS) is an app for Django build on top of the scraping framework Scrapy. While preserving many of the features of Scrapy it lets you dynamically create and manage spiders via the Django admin interface.

这篇关于从Django视图启动Scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆