Apache Flume-仅发送新文件内容 [英] Apache Flume - send only new file contents

查看:89
本文介绍了Apache Flume-仅发送新文件内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Flume的新用户,请将我视为绝对的菜鸟.我在为特定用例配置Flume时遇到一个小问题,希望您能提供帮助.请注意,我没有使用HDFS,这就是为什么这个问题与您在论坛上可能看到的其他问题不同的原因.

I am a very new user to Flume, please treat me as an absolute noob. I am having a minor issue configuring Flume for a particular use case and was hoping you could assist. Note that I am not using HDFS, which is why this question is different from others you may have seen on forums.

我有两个通过Oracle Virtual Box上的内部网络相互连接的虚拟机(VM).我的目标是让一个VM监视一个特定的目录,该目录中永远只有一个文件.更改文件后,我希望Flume仅发送新行/数据.我希望其他VM接收此数据,并将数据更新/连接为该文件在特定目录中的单个文件.

I have two Virtual Machines (VMs) connected to each other through an internal network on Oracle Virtual Box. My goal is to have one VM watch a particular directory that will only ever have one file in it. When the file is changed, I wish for Flume to only send only the new lines/data. I want the other VM to receive this data and update/concatenate the data to a single file in a particular directory on it.

到目前为止,我已经接近完成此过程.每当在VM1中进行更改时,它们都会在VM2上进行更新.但是,每次将VM1上的整个文件发送到VM2,而不是新行.例如,如果我写了"Test1",然后在下面写了一段"Test2"到VM1上的文件,则在VM2上,输出为:

So far, I have this process very close to working. Whenever changes are made in VM1, they are updated on VM2. However, the entire file on VM1 is sent to VM2 every time, not the new lines. For example, if I wrote "Test1" and then a while later underneath wrote "Test2" to the file on VM1, on VM2 the output would be:

Test1

Test1

Test2

我想看的是:

            Test1

            Test2

我不确定如何实现此功能,并且在仔细阅读了Flume用户指南文档以及关于stackoverflow/stackexchange的最相关文章之后,正在发送此电子邮件.供您参考,以下是当前配置(它们以我上面提到的方式工作).

I am not sure how to implement this, and am sending this email after thoroughly examining the Flume user guide documentation and most relevant articles on stackoverflow/stackexchange. For your reference, below are the current configurations(they are working in the manner I mentioned above).

VM1配置

VM2配置

我意识到另一种解决方案是在每次检测到新内容时保留VM1上的配置并覆盖VM2上的文件.但是,我也不确定如何实现这一点.

I realize another solution would be to keep the configuration on VM1 and overwrite the file on VM2 everytime new contents are detected. However, I am also unsure how to implement this.

非常感谢您能提供的任何帮助!

Any assistance you could provide is greatly appreciated!

推荐答案

使用Flume中提供的TailDir源.它定期将最后读取的位置写入位置文件,并且比exec源更可靠,即使代理崩溃或停止也是如此.由于某种原因,它将从保存在位置文件中的最后一个位置开始读取.

Use TailDir source provided in Flume.It periodically writes last position read in position file and its more reliable than exec source as even in case of agent crashes or stops for some reason it will start reading from last position saved in the position file.

agent1.sources.src1.type = TAILDIR 
agent1.sources.src1.channels = ch1 
agent1.sources.src1.filegroups =f1
agent1.sources.src1.filegroups.f1= //path to log file 
agent1.sources.src1.maxBackoffSleep = 10000

根据需要设置maxBackoffSleep值,这意味着在上次尝试未发现任何更改时,代理在轮询日志文件中的更改之前应等待的最大时间.

Set maxBackoffSleep value as per your need it means how much max time agent should wait before polling for changes in log file , when it didnt find any changes in last attempt made.

这篇关于Apache Flume-仅发送新文件内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆