如何从AWS Lambda运行Scrapy Spider? [英] How to run a Scrapy spider from AWS Lambda?

查看:107
本文介绍了如何从AWS Lambda运行Scrapy Spider?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从AWS Lambda内部运行一个抓痒的蜘蛛.这是我当前的脚本的样子,它正在抓取测试数据.

I'm trying to run a scrapy spider from within AWS Lambda. Here is what my current script looks like, which is scraping test data.

import boto3
import scrapy
from scrapy.crawler import CrawlerProcess

s3 = boto3.client('s3')
BUCKET = 'sample-bucket'

class BookSpider(scrapy.Spider):
    name = 'bookspider'
    start_urls = [
        'http://books.toscrape.com/'
    ]

    def parse(self, response):
        for link in response.xpath('//article[@class="product_pod"]/div/a/@href').extract():
            yield response.follow(link, callback=self.parse_detail)
        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        title = response.xpath('//div[contains(@class, "product_main")]/h1/text()').extract_first()
        price = response.xpath('//div[contains(@class, "product_main")]/'
                               'p[@class="price_color"]/text()').extract_first()
        availability = response.xpath('//div[contains(@class, "product_main")]/'
                                      'p[contains(@class, "availability")]/text()').extract()
        availability = ''.join(availability).strip()
        upc = response.xpath('//th[contains(text(), "UPC")]/'
                             'following-sibling::td/text()').extract_first()
        yield {
            'title': title,
            'price': price,
            'availability': availability,
            'upc': upc
        }

def main(event, context):
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'FEED_FORMAT': 'json',
        'FEED_URI': 'result.json'
    })

    process.crawl(BookSpider)
    process.start() # the script will block here until the crawling is finished

    data = open('result.json', 'rb')
    s3.put_object(Bucket = BUCKET, Key='result.json', Body=data)
    print('All done.')

if __name__ == "__main__":
    main('', '')

我首先在本地测试了此脚本,该脚本正常运行,将数据抓取并保存到'results.json',然后将其上传到我的S3存储桶中.

I first locally tested this script and it was running as normal, scraping the data and saving it to a 'results.json', and then uploading it to my S3 bucket.

然后,我按照此处的指南配置了我的AWS Lambda函数: https://serverless .com/blog/serverless-python-packaging/,它将成功导入AWS Lambda内的Scrapy库以执行.

Then, I configured my AWS Lambda function by following the guide here: https://serverless.com/blog/serverless-python-packaging/ and it successfully imports the Scrapy library within AWS Lambda for execution.

但是,当脚本在AWS Lambda上运行时,它不会抓取数据,只会针对 results.json不存在

However, when the script is run on AWS Lambda, it does not scrape data and simply throws an error for results.json does not exist

任何配置了运行Scrapy或有变通办法或可以向我指出正确方向的人都将受到高度赞赏.

Anyone who has configured running Scrapy or has a workaround or can point me in the right direction would be highly appreciated.

谢谢.

推荐答案

只是在寻找其他东西的时候碰到了这个问题,但是却超出了我的脑袋.

Just came across this whilst looking for something else, but off the top of my head..

Lambda在/tmp中提供临时存储,所以我建议设置

Lambdas provide temp storage in /tmp, so I would suggest setting

'FEED_URI': '/tmp/result.json'

然后

data = open('/tmp/result.json', 'rb')

在lambda中使用临时存储可能有各种各样的最佳实践,所以我建议花一些时间阅读这些知识.

There are likely all sorts of best practices around using temp storage in lambdas, so I'd suggest spending a bit of time reading up on those.

这篇关于如何从AWS Lambda运行Scrapy Spider?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆