如何将数据从Amazon SQS流传输到Amazon S3中的文件 [英] How to stream data from Amazon SQS to files in Amazon S3

查看:384
本文介绍了如何将数据从Amazon SQS流传输到Amazon S3中的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何快速创建从Amazon SQS读取json数据并将其保存到s3存储桶中的avro文件(可能是其他格式)中的机制,并按json消息中给定字段的日期和值进行分区?

How to quickly create mechanism that reads json data from Amazon SQS and saves it in avro files (may be other format) in s3 bucket, partitioned by date and value of given field in json message?

推荐答案

您可以编写一个 AWS Lambda函数,该函数可通过将消息发送到Amazon SQS队列来触发.您有责任编写该代码,因此答案是,这取决于您的编码技能.

You can write an AWS Lambda function that gets triggered by a message being sent to an Amazon SQS queue. You are responsible for writing that code, so the answer is that it depends on your coding skill.

但是,如果每条消息都是单独处理的,您最终会得到每个SQS消息一个Amazon S3对象,这对于处理效率非常低.该文件为Avro格式的事实无关紧要,因为每个文件都非常小.在处理文件时,这会增加很多开销.

However, if each message is processed individually, you will end up with one Amazon S3 object per SQS message, which is quite inefficient to process. The fact that the file is in Avro format is irrelevant because each file will be quite small. This will add a lot of overhead when processing the files.

另一种选择是将消息发送到 Amazon Kinesis数据流,后者可以按大小(例如,每5MB)或时间(例如,每5分钟)将消息聚合在一起.这样可以减少S3中较大的对象,但它们不会被分区,也不会采用Avro格式.

An alternative could be to send the messages to an Amazon Kinesis Data Stream, which can aggregate messages together by size (eg every 5MB) or time (eg every 5 minutes). This will result in fewer, larger objects in S3 but they will not be partitioned, nor in Avro format.

要从Avro之类的列格式中获得最佳性能,请将数据合并到更大的文件中,以便更有效地处理.因此,例如,您可以使用Kinesis收集数据,然后使用Amazon EMR的日常工作将这些文件合并为分区的Avro文件.

To get the best performance out of a columnar format like Avro, combine the data into larger files that will be more efficient for processing. So, for example, you could use Kinesis for collecting the data, then a daily Amazon EMR job to combine those files into partitioned Avro files.

因此,答案是:这很容易,但是您可能不想这样做."

So, the answer is: "It's pretty easy, but you probably don't want to do it."

您的问题并未定义数据如何进入SQS.如果您希望在一段时间内(例如1小时或1天)愿意在SQS中积累数据,而不是立即处理它们,而不是立即处理这些消息,则可以编写一个程序读取所有消息,并将它们输出到分区的Avro文件中.这将SQS用作临时保存区域,从而允许在处理数据之前进行数据累积.但是,它将失去任何实时报告方面.

Your question does not define how the data gets into SQS. If, rather than processing messages as soon as they arrive, you are willing for the data to accumulate in SQS for some period of time (eg 1 hour or 1 day), you could then write a program that reads all of the messages and outputs them into partitioned Avro files. This uses SQS as a temporary holding area, allowing data to accumulate before being processed. However, it would lose any real-time reporting aspect.

这篇关于如何将数据从Amazon SQS流传输到Amazon S3中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆