CrawlerRunner 不使用钩针抓取页面 [英] CrawlerRunner not crawl pages with Crochet

查看:23
本文介绍了CrawlerRunner 不使用钩针抓取页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 CrawlerRunner() 从脚本启动 Scrapy 以在 AWS Lambda 中启动.

I am trying to launch a Scrapy from script with CrawlerRunner() to launch in AWS Lambda.

我在 Stackoverflow 中观看了带有钩针库的解决方案,但它对我不起作用.

I watched in Stackoverflow the solution with crochet library, but it doesn´t work for me.

链接:StackOverflow 1 StackOverflow 2

这是代码:

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging

# From response in Stackoverflow: https://stackoverflow.com/questions/41495052/scrapy-reactor-not-restartable
from crochet import setup
setup()

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]

        print ('Scrapped page n', page)


    def closed(self, reason):
        print ('Closed Spider: ', reason)


def run_spider():

    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

    crawler = CrawlerRunner(get_project_settings())
    crawler.crawl(QuotesSpider)        


run_spider()

当我执行脚本时,它返回了这个日志:

and when I execute the script, it returned this log:

INFO: Overridden settings: {}
2019-01-28 16:49:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2019-01-28 16:49:52 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-28 16:49:52 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-28 16:49:52 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-01-28 16:49:52 [scrapy.core.engine] INFO: Spider opened
2019-01-28 16:49:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-28 16:49:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

为什么爬虫不爬蜘蛛?我用 Mac 和 Python 3.7.1 运行.

Why the crawler not crawl the Spider? I run with Mac and Python 3.7.1.

有什么帮助吗??非常感谢您的支持.

Any help?? I appreciate so much your support.

推荐答案

我运行你的代码,可以看到spider正在运行,但是在解析函数中看不到任何内容打印.

I run your code and can see spider is running, but can't see any content printing in parse function.

我加了

==your code end===
time.sleep(10) 

在你的代码的最后,我可以看到解析函数的打印出来.

in the end of your code, then i can see the print out of parse function.

所以这可能是主进程在进入解析之前结束的原因

So it might be the reason the the main process end before enter into the parse

这篇关于CrawlerRunner 不使用钩针抓取页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆