Amazon EC2 linux实例上托管的scrapyd实例的输入/输出 [英] input/output for scrapyd instance hosted on an Amazon EC2 linux instance

查看：78 发布时间：2020/8/24 0:40:52 python amazon-web-services web-scraping scrapy scrapyd

本文介绍了Amazon EC2 linux实例上托管的scrapyd实例的输入/输出的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

最近，我开始使用scrapy构建网络刮板.最初，我是使用scrapyd在本地部署我的scrapy项目的.

Recently I began working on building web scrapers using scrapy. Originally I had deployed my scrapy projects locally using scrapyd.

我构建的scrapy项目依赖于从CSV文件访问数据才能运行

The scrapy project I built relies on accessing data from a CSV file in order to run

 def search(self, response):
    with open('data.csv', 'rb') as fin:
        reader = csv.reader(fin)
        for row in reader:
            subscriberID = row[0]
            newEffDate = datetime.datetime.now()
            counter = 0
            yield scrapy.Request(
                url = "https://www.healthnet.com/portal/provider/protected/patient/results.action?__checkbox_viewCCDocs=true&subscriberId=" + subscriberID + "&formulary=formulary",
                callback = self.find_term,
                meta = {
                    'ID': subscriberID,
                    'newDate': newEffDate,
                    'counter' : counter
                    }
                )

它将抓取的数据输出到另一个CSV文件

It outputs scraped data to another CSV file

 for x in data:
        with open('missing.csv', 'ab') as fout:
            csvwriter = csv.writer(fout, delimiter = ',')
            csvwriter.writerow([oldEffDate.strftime("%m/%d/%Y"),subscriberID,ipa])
            return

我们正处于构建需要访问和运行这些爬虫的应用程序的初始阶段.我决定将我的scrapyd实例托管在一个AWS EC2 linux实例上.部署到AWS很简单( http ://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/).

We are in the initial stages of building an application that needs to access and run these scrapy spiders. I decided to host my scrapyd instance on an AWS EC2 linux instance. Deploying to AWS was straightforward (http://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/).

我如何向在AWS EC2 linux实例上运行的scrapyd实例输入/输出抓取的数据?

我假设传递文件看起来像

I'm assuming passing a file would look like

curl http://my-ec2.amazonaws.com:6800/schedule.json -d project=projectX -d spider=spider2b -d in=file_path

这是正确的吗?我将如何从该蜘蛛运行中获取输出?这种方法是否存在安全性问题?

Is this correct? How would I grab the output from this spider run? Does this approach have security issues?

Amazon EC2 linux实例上托管的scrapyd实例的输入/输出 [英] input/output for scrapyd instance hosted on an Amazon EC2 linux instance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Amazon EC2 linux实例上托管的scrapyd实例的输入/输出 [英] input/output for scrapyd instance hosted on an Amazon EC2 linux instance

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭