Kafka Storm HDFS/S3 数据流 [英] Kafka Storm HDFS/S3 data flow

查看:33
本文介绍了Kafka Storm HDFS/S3 数据流的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前还不清楚您是否可以像在 Flume 中那样在 Kafka 中进行扇出(复制).

It is unclear if you can do a fan-out (duplication) in Kafka like you can in Flume.

我想让 Kafka 将数据保存到 HDFS 或 S3,并将该数据的副本发送到 Storm 进行实时处理.Storm 聚合/分析的输出将存储在 Cassandra 中.我看到一些实现将所有数据从 Kafka 流入 Storm,然后从 Storm 流出两个输出.但是,我想消除 Storm 对原始数据存储的依赖.

I'd like to have Kafka save data to HDFS or S3 and send a duplicate of that data to Storm for real time processing. The output of Storm aggregations/analysis will be stored in Cassandra. I see some implementations flowing all data from Kafka into Storm and then two outputs from Storm. However, I'd like to eliminate the dependency of Storm for the raw data storage.

这可能吗?您知道任何此类文档/示例/实现吗?

Is this possible? Are you aware of any documentation/examples/implementations like this?

另外,Kafka 对 S3 存储有很好的支持吗?

Also, does Kafka have good support for S3 storage?

我看到 Camus 用于存储到 HDFS——您是否只是通过 cron 运行此作业以不断将数据从 Kafka 加载到 HDFS?如果作业的第二个实例在前一个实例完成之前开始,会发生什么情况?最后,Camus 会与 S3 一起工作吗?

I saw Camus for storing to HDFS -- do you just run this job via cron to continually load data from Kafka to HDFS? What happens if a second instance of the job starts before the previous has finished? Finally, would Camus work with S3?

谢谢——我很感激!

推荐答案

Kafka 实际上会将事件保留一段可配置的时间——事件不会像其他消息或队列系统那样在消费后立即清除.这允许您拥有多个消费者,可以在开始时(根据可配置的保留时间)或从偏移量读取 Kafka.

Kafka actually retains events for a configurable period of time -- events are not purged immediately upon consumption like other message or queue systems. This allows you to have multiple consumers that can read from Kafka either at the beginning (per the configurable retention time) or from an offset.

对于所描述的用例,您将使用 Camus 将事件批量加载到 hadoop,并使用 Storm 从相同的 Kafka 输出读取事件.只需确保两个进程在可配置的保留时间到期之前读取新事件即可.

For the use case described, you would use Camus to batch load events to hadoop, and Storm to read events off the same Kafka output. Just ensure both processes read new events before the configurable retention time expires.

关于加缪,ggupta1612在这方面回答得最好

Regarding Camus, ggupta1612 answered this aspect best

启动作业的调度程序应该可以工作.他们在 LinkedIn 上使用的是 Azkaban,你也可以看看.

A scheduler that launches the job should work. What they use at LinkedIn is Azkaban, you can look at that too.

如果一个在另一个完成之前启动,一些数据将被读取两次.因为第二个作业将从第一个作业使用的相同偏移量开始读取.

If one launches before the other finishes, some amount of data will be read twice. Since the second job will start reading from the same offsets used by the first one.

这篇关于Kafka Storm HDFS/S3 数据流的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆