Amazon EC2 linux实例上托管的scrapyd实例的输入/输出 [英] input/output for scrapyd instance hosted on an Amazon EC2 linux instance
问题描述
最近,我开始使用scrapy构建网络刮板.最初,我是使用scrapyd在本地部署我的scrapy项目的.
Recently I began working on building web scrapers using scrapy. Originally I had deployed my scrapy projects locally using scrapyd.
我构建的scrapy项目依赖于从CSV文件访问数据才能运行
The scrapy project I built relies on accessing data from a CSV file in order to run
def search(self, response):
with open('data.csv', 'rb') as fin:
reader = csv.reader(fin)
for row in reader:
subscriberID = row[0]
newEffDate = datetime.datetime.now()
counter = 0
yield scrapy.Request(
url = "https://www.healthnet.com/portal/provider/protected/patient/results.action?__checkbox_viewCCDocs=true&subscriberId=" + subscriberID + "&formulary=formulary",
callback = self.find_term,
meta = {
'ID': subscriberID,
'newDate': newEffDate,
'counter' : counter
}
)
它将抓取的数据输出到另一个CSV文件
It outputs scraped data to another CSV file
for x in data:
with open('missing.csv', 'ab') as fout:
csvwriter = csv.writer(fout, delimiter = ',')
csvwriter.writerow([oldEffDate.strftime("%m/%d/%Y"),subscriberID,ipa])
return
我们正处于构建需要访问和运行这些爬虫的应用程序的初始阶段.我决定将我的scrapyd实例托管在一个AWS EC2 linux实例上.部署到AWS很简单( http ://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/).
We are in the initial stages of building an application that needs to access and run these scrapy spiders. I decided to host my scrapyd instance on an AWS EC2 linux instance. Deploying to AWS was straightforward (http://bgrva.github.io/blog/2014/04/13/deploy-crawler-to-ec2-with-scrapyd/).
我如何向在AWS EC2 linux实例上运行的scrapyd实例输入/输出抓取的数据?
我假设传递文件看起来像
I'm assuming passing a file would look like
curl http://my-ec2.amazonaws.com:6800/schedule.json -d project=projectX -d spider=spider2b -d in=file_path
这是正确的吗?我将如何从该蜘蛛运行中获取输出?这种方法是否存在安全性问题?
Is this correct? How would I grab the output from this spider run? Does this approach have security issues?
推荐答案
S3是一个选择吗?我问是因为您已经在使用EC2.如果是这样,您可以从S3中读取/写入.
Is S3 an option? I'm asking because you're already using EC2. If that's the case, you could read/write from/to S3.
我有点困惑,因为您提到了CSV和JSON格式.如果您正在阅读CSV,则可以使用 CSVFeedSpider .无论哪种方式,您都可以使用 boto 从蜘蛛的start_requests
方法.
I'm a bit confused because you mentioned both CSV and JSON formats. If you're reading CSV, you could use CSVFeedSpider. Either way, you could also use boto to read from S3 in your spider's __init__
or start_requests
method.
关于输出,此页面解释如何使用Feed导出将抓取的输出写入S3.
Regarding the output, this page explains how to use feed exports to write the output of a crawl to S3.
相关设置:
- FEED_URI
- FEED_FORMAT
- AWS_ACCESS_KEY_ID
- AWS_ACCESS_SECRET_KEY
这篇关于Amazon EC2 linux实例上托管的scrapyd实例的输入/输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!