如何从AWS Lambda运行Scrapy Spider? [英] How to run a Scrapy spider from AWS Lambda?

查看：107 发布时间：2020/8/23 20:40:30 python-3.x amazon-web-services scrapy aws-lambda

本文介绍了如何从AWS Lambda运行Scrapy Spider?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从AWS Lambda内部运行一个抓痒的蜘蛛.这是我当前的脚本的样子，它正在抓取测试数据.

I'm trying to run a scrapy spider from within AWS Lambda. Here is what my current script looks like, which is scraping test data.

import boto3
import scrapy
from scrapy.crawler import CrawlerProcess

s3 = boto3.client('s3')
BUCKET = 'sample-bucket'

class BookSpider(scrapy.Spider):
    name = 'bookspider'
    start_urls = [
        'http://books.toscrape.com/'
    ]

    def parse(self, response):
        for link in response.xpath('//article[@class="product_pod"]/div/a/@href').extract():
            yield response.follow(link, callback=self.parse_detail)
        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        title = response.xpath('//div[contains(@class, "product_main")]/h1/text()').extract_first()
        price = response.xpath('//div[contains(@class, "product_main")]/'
                               'p[@class="price_color"]/text()').extract_first()
        availability = response.xpath('//div[contains(@class, "product_main")]/'
                                      'p[contains(@class, "availability")]/text()').extract()
        availability = ''.join(availability).strip()
        upc = response.xpath('//th[contains(text(), "UPC")]/'
                             'following-sibling::td/text()').extract_first()
        yield {
            'title': title,
            'price': price,
            'availability': availability,
            'upc': upc
        }

def main(event, context):
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'FEED_FORMAT': 'json',
        'FEED_URI': 'result.json'
    })

    process.crawl(BookSpider)
    process.start() # the script will block here until the crawling is finished

    data = open('result.json', 'rb')
    s3.put_object(Bucket = BUCKET, Key='result.json', Body=data)
    print('All done.')

if __name__ == "__main__":
    main('', '')

我首先在本地测试了此脚本，该脚本正常运行，将数据抓取并保存到'results.json'，然后将其上传到我的S3存储桶中.

I first locally tested this script and it was running as normal, scraping the data and saving it to a 'results.json', and then uploading it to my S3 bucket.

然后，我按照此处的指南配置了我的AWS Lambda函数: https://serverless .com/blog/serverless-python-packaging/，它将成功导入AWS Lambda内的Scrapy库以执行.

Then, I configured my AWS Lambda function by following the guide here: https://serverless.com/blog/serverless-python-packaging/ and it successfully imports the Scrapy library within AWS Lambda for execution.

但是，当脚本在AWS Lambda上运行时，它不会抓取数据，只会针对 results.json不存在

However, when the script is run on AWS Lambda, it does not scrape data and simply throws an error for results.json does not exist

任何配置了运行Scrapy或有变通办法或可以向我指出正确方向的人都将受到高度赞赏.

Anyone who has configured running Scrapy or has a workaround or can point me in the right direction would be highly appreciated.

谢谢.

如何从AWS Lambda运行Scrapy Spider? [英] How to run a Scrapy spider from AWS Lambda?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从AWS Lambda运行Scrapy Spider? [英] How to run a Scrapy spider from AWS Lambda?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭