Scrapy爬网程序不能从Python脚本同时运行 [英] Scrapy crawlers not running simultaneously from Python script

查看：197 发布时间：2016/12/20 17:36:14 python command-line scrapy

本文介绍了Scrapy爬网程序不能从Python脚本同时运行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我只是想知道为什么会发生这种情况。这是我的Python脚本运行所有：

I was just wondering why this might be occurring. Here is my Python script to run all:

from scrapy import cmdline

file = open('cityNames.txt', 'r')
cityNames = file.read().splitlines()

for city in cityNames:
    url = "http://" + city + ".website.com"
    output = city + ".json"

    cmdline.execute(['scrapy', 'crawl', 'backpage_tester', '-a', "start_url="+url, '-o', ""+output])

cityNames.txt ：

cityNames.txt:

chicago
sanfran
boston

它运行通过第一城市罚款，但然后停止。它不运行sanfran或波士顿 - 只有芝加哥。有什么想法吗？谢谢！

It runs the through the first city fine, but then stops after that. It doesn't run sanfran or boston - only chicago. Any thoughts? Thank you!

推荐答案

您的方法正在使用同步调用。你应该在Python中使用异步调用（asyncio？）或使用一个bash脚本来遍历你的url的文本文件：

Your method is using synchronous calls. You should use asynchronous calls in Python (asyncio?) or use a bash script that iterates over a text file of your urls:

cat urls.txt | xargs -0 -I{} scrapy crawl spider_name -a start_url={}

进程每个URL。但是，请注意 - 如果每个网站上的抓取都很广泛，并且您的蜘蛛网配置不正确，这可能会使系统容易过载。

this should issue one scrapy process per url. However, be warned-- this could easily overload your system if those crawls are extensive and deep on each site, and your spiders are not properly configured.

这篇关于Scrapy爬网程序不能从Python脚本同时运行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy爬网程序不能从Python脚本同时运行 [英] Scrapy crawlers not running simultaneously from Python script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy爬网程序不能从Python脚本同时运行 [英] Scrapy crawlers not running simultaneously from Python script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭