HDFS信宿:“聪明”文件夹路由 [英] HDFS sink: "clever" folder routing

查看:125
本文介绍了HDFS信宿:“聪明”文件夹路由的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我有一个多租户应用程序(大约有100个不同的客户至于现在的
)。
我有16种不同的数据类型。



(在生产中,我们每天通过
RabbitMQ获得大约1,500万条消息)



我想写HDFS所有的事件,按租户,数据类型,
和日期分隔,如下所示:

  / data / {tenant} / {data_type} /2014/10/15/file-08.csv 

一个接收器定义可能吗?我不想复制
配置,并且新客户每周都会到达。



在文档中,我看到

  agent1.sinks.hdfs-sink1.hdfs.path = hdfs:// server / events /%Y /%m /%d /%H / 

这可能吗?

  agent1.sinks.hdfs-sink1.hdfs.path = hdfs:// server / events /%tenant /%type /%Y /%m /%d /%H / 

我想根据传入的数据写入不同的文件夹。

解决方案

是的,这确实是可能的。您可以使用输入数据中的元数据或某个字段将输出重定向到。

例如,在我的情况下,我得到不同类型的日志数据,我想将它存储在相应的文件夹中。在我的情况下,我的日志行中的第一个单词是文件名。



拦截器:

  dataplatform.sources.source1.interceptors = i3 
dataplatform.sources.source1.interceptors.i3.type = regex_extractor
dataplatform.sources.source1.interceptors.i3.regex = ^(\\ * )\ t。*
dataplatform.sources.source1.interceptors.i3.serializers = s1
dataplatform.sources.source1.interceptors.i3.serializers.s1.name = filename
code>

HDFS Sink

 数据平台.sinks.sink1.type = hdfs 
dataplatform.sinks.sink1.hdfs.path = hdfs:// server / events / provider =%{文件名} /年=%Y /月=%Y%m /天=%Y%m%d / hours =%H

希望这有助于您。


I am new to Flume (and to HDFS), so I hope my question is not stupid.

I have a multi-tenant application (about 100 different customers as for now). I have 16 different data types.

(In production, we have approx. 15 million messages/day through our RabbitMQ)

I want to write to HDFS all my events, separated by tenant, data type, and date, like this :

/data/{tenant}/{data_type}/2014/10/15/file-08.csv

Is it possible with one sink definition ? I don't want to duplicate configuration, and new client arrive every week or so

In documentation, I see

agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%Y/%m/%d/%H/

Is this possible ?

agent1.sinks.hdfs-sink1.hdfs.path = hdfs://server/events/%tenant/%type/%Y/%m/%d/%H/

I want to write to different folders according to my incoming data.

解决方案

Yes this is indeed possible. You can either use the metadata or some field in the incoming data to redirect the output to.

For example, in my case I am getting different types of log data and I want to store it in respective folders accordingly. Also in my case the first word in my log lines is the file name. Here is the config snippet for the same.

Interceptor:

dataplatform.sources.source1.interceptors = i3
dataplatform.sources.source1.interceptors.i3.type = regex_extractor
dataplatform.sources.source1.interceptors.i3.regex = ^(\\w*)\t.*
dataplatform.sources.source1.interceptors.i3.serializers = s1
dataplatform.sources.source1.interceptors.i3.serializers.s1.name = filename

HDFS Sink

dataplatform.sinks.sink1.type = hdfs
dataplatform.sinks.sink1.hdfs.path = hdfs://server/events/provider=%{filename}/years=%Y/months=%Y%m/days=%Y%m%d/hours=%H

Hope this helps.

这篇关于HDFS信宿:“聪明”文件夹路由的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆