Amazon EC2 linux实例上托管的scrapyd实例的输入/输出 [英] input/output for scrapyd instance hosted on an Amazon EC2 linux instance

查看:78
本文介绍了Amazon EC2 linux实例上托管的scrapyd实例的输入/输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我开始使用scrapy构建网络刮板.最初,我是使用scrapyd在本地部署我的scrapy项目的.

Recently I began working on building web scrapers using scrapy. Originally I had deployed my scrapy projects locally using scrapyd.

我构建的scrapy项目依赖于从CSV文件访问数据才能运行

The scrapy project I built relies on accessing data from a CSV file in order to run

 def search(self, response):
    with open('data.csv', 'rb') as fin:
        reader = csv.reader(fin)
        for row in reader:
            subscriberID = row[0]
            newEffDate = datetime.datetime.now()
            counter = 0
            yield scrapy.Request(
                url = "https://www.healthnet.com/portal/provider/protected/patient/results.action?__checkbox_viewCCDocs=true&subscriberId=" + subscriberID + "&formulary=formulary",
                callback = self.find_term,
                meta = {
                    'ID': subscriberID,
                    'newDate': newEffDate,
                    'counter' : counter
                    }
                )

它将抓取的数据输出到另一个CSV文件

It outputs scraped data to another CSV file

 for x in data:
        with open('missing.csv', 'ab') as fout:
            csvwriter = csv.writer(fout, delimiter = ',')
            csvwriter.writerow([oldEffDate.strftime("%m/%d/%Y"),subscriberID,ipa])
            return

我们正处于构建需要访问和运行这些爬虫的应用程序的初始阶段.我决定将我的scrapyd实例托管在一个AWS EC2 linux实例上.部署到AWS很简单( http ://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/).

We are in the initial stages of building an application that needs to access and run these scrapy spiders. I decided to host my scrapyd instance on an AWS EC2 linux instance. Deploying to AWS was straightforward (http://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/).

我如何向在AWS EC2 linux实例上运行的scrapyd实例输入/输出抓取的数据?

我假设传递文件看起来像

I'm assuming passing a file would look like

curl http://my-ec2.amazonaws.com:6800/schedule.json -d project=projectX -d spider=spider2b -d in=file_path

这是正确的吗?我将如何从该蜘蛛运行中获取输出?这种方法是否存在安全性问题?

Is this correct? How would I grab the output from this spider run? Does this approach have security issues?

推荐答案

S3是一个选择吗?我问是因为您已经在使用EC2.如果是这样,您可以从S3中读取/写入.

Is S3 an option? I'm asking because you're already using EC2. If that's the case, you could read/write from/to S3.

我有点困惑,因为您提到了CSV和JSON格式.如果您正在阅读CSV,则可以使用 CSVFeedSpider .无论哪种方式,您都可以使用 boto 从蜘蛛的start_requests方法.

I'm a bit confused because you mentioned both CSV and JSON formats. If you're reading CSV, you could use CSVFeedSpider. Either way, you could also use boto to read from S3 in your spider's __init__ or start_requests method.

关于输出,此页面解释如何使用Feed导出将抓取的输出写入S3.

Regarding the output, this page explains how to use feed exports to write the output of a crawl to S3.

相关设置:

  • FEED_URI
  • FEED_FORMAT
  • AWS_ACCESS_KEY_ID
  • AWS_ACCESS_SECRET_KEY

这篇关于Amazon EC2 linux实例上托管的scrapyd实例的输入/输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆