Kafka Storm HDFS / S3数据流 [英] Kafka Storm HDFS/S3 data flow

查看:165
本文介绍了Kafka Storm HDFS / S3数据流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前还不清楚你是否可以像在Flume中那样在卡夫卡做粉丝(复制)。



我想让Kafka将数据保存到HDFS或S3中,并将该数据的副本发送给Storm以进行实时处理。 Storm聚合/分析的输出将存储在Cassandra中。我看到一些实现将来自Kafka的所有数据流入Storm,然后从Storm中输出两个输出。但是,我想消除Storm对原始数据存储的依赖性。



这可能吗?您是否知道这样的文档/示例/实现?



另外,Kafka是否对S3存储有很好的支持?

我将Camus存储到HDFS中 - 您是否通过cron运行此作业,以便将数据从Kafka持续加载到HDFS?如果在上一次完成之前开始第二个工作实例,会发生什么情况?最后,Camus会与S3一起工作吗?



谢谢 - 我欣赏它!

Kafka实际上在可配置的时间段内保留事件 - 事件不会像其他消息或队列系统一样在消耗后立即清除。这允许您有多个使用者可以在开始时(每个可配置的保留时间)或从偏移量读取Kafka。

对于所描述的用例,您可以使用Camus将事件批量载入到hadoop,Storm从相同的Kafka 输出中读取事件。只要确保这两个进程在配置保留时间到期前就读取新事件即可。



关于加缪, ggupta1612 回答了这方面最好的



启动工作的调度程序应该可以工作。他们在LinkedIn上使用的是Azkaban,你也可以看看。

如果一个在另一个完成之前启动,一些数据将被读取两次。由于第二份工作将从第一份工作使用的相同抵消开始读取。

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.

I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.

Is this possible? Are you aware of any documentation/examples/implementations like this?

Also, does Kafka have good support for S3 storage?

I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?

Thanks -- I appreciate it!

解决方案

Kafka actually retains events for a configurable period of time -- events are not purged immediately upon consumption like other message or queue systems. This allows you to have multiple consumers that can read from Kafka either at the beginning (per the configurable retention time) or from an offset.

For the use case described, you would use Camus to batch load events to hadoop, and Storm to read events off the same Kafka output. Just ensure both processes read new events before the configurable retention time expires.

Regarding Camus, ggupta1612 answered this aspect best

A scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.

If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

这篇关于Kafka Storm HDFS / S3数据流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆