使用Apache火花流的实时日志处理 [英] real time log processing using apache spark streaming

查看:285
本文介绍了使用Apache火花流的实时日志处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个系统,我可以实时读取日志,使用Apache的火花来处理它。我很困惑,如果我要使用类似卡夫卡或水槽通过日志火花流或我应该通过使用套接字的日志。我已经通过示例程序火花流文档 - 星火流例子了​​。不过,我会很感激,如果有人能指导我更好的方式来传递日志火花流。同类新的草皮给我。


解决方案

Apache的水槽可以帮助实时读取日志。
水槽提供日志收集和运输到星火流来分析所需的信息的应用程序。

1 官方网站或按照下载Apache水槽从这里 说明p>

2 设置和运行水槽
   从安装水槽的目录(FLUME_INSTALLATION_PATH \\ CONF)修改水槽-conf.properties.template,在这里你需要提供的日志源,信道和汇(输出)。有关安装这里

有推出水槽的一个例子收集日志平安COMAND信息Windows主机上运行,​​并将其写入文件:

flume-conf.properties

  agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSinkagent.sources.seqGenSrc.type = EXEC
agent.sources.seqGenSrc.shell = PowerShell的-Commandagent.sources.seqGenSrc.command =为(){平google.com}agent.sources.seqGenSrc.channels = memoryChannelagent.sinks.loggerSink.type = file_rollagent.sinks.loggerSink.channel = memoryChannel
agent.sinks.loggerSink.sink.directory = D:\\\\ \\\\ TMP \\\\感
agent.sinks.loggerSink.serializer =文本
agent.sinks.loggerSink.appendNewline = FALSE
agent.sinks.loggerSink.rollInterval = 0agent.channels.memoryChannel.type =内存
agent.channels.memoryChannel.capacity = 100

要运行这个例子去FLUME_INSTALLATION_PATH和执行

 的Java -Xmx20m -Dlog4j.configuration =文件:///%CD%\\的conf \\ log4j.properties -cp \\ LIB \\ * org.apache.flume.node.Application -f的conf \\ flume-conf.properties -n剂

或者你可以创建你的Java应用程序,有水槽库在类路径和相应的传递参数​​的应用程序调用org.apache.flume.node.Application实例。

如何设置水槽,收集和运输日志?

您可以使用一些脚本,用于收集从指定的位置记录

  agent.sources.seqGenSrc.shell = PowerShell的-Command
agent.sources.seqGenSrc.command =你在这里的脚本

而不是提供智能日志收集Windows脚本也可以(在现场把公司的Java path_to_main_class参数)推出的Java应用程序。例如,如果该文件是在实时改性可以使用赣骏来自Apache下议院IO 的。
要配置水槽运输日志infromation阅读文章

3。从源$ C ​​$ C获取水槽流和星火进行分析。
采取从GitHub <一个一个code样品看看href=\"https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java\" rel=\"nofollow\">https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java

I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to spark stream or should I pass the logs using sockets. I have gone through a sample program in the spark streaming documentation- Spark stream example. But I will be grateful if someone can guide me a better way to pass logs to spark stream. Its kind of a new turf to me.

解决方案

Apache Flume may help to read the logs in real time. Flume provides logs collection and transport to the application where Spark Streaming is used to analyze required information.

1. Download Apache Flume from official site or follow the instructions from here

2. Setup and run Flume modify flume-conf.properties.template from the directory where Flume is installed (FLUME_INSTALLATION_PATH\conf), here you need to provide logs source, channel and sinks (output). More details about setup here

There is an example of launching flume which collects log information from ping comand running on windows host and writes it to a file:

flume-conf.properties

agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink

agent.sources.seqGenSrc.type = exec
agent.sources.seqGenSrc.shell = powershell -Command

agent.sources.seqGenSrc.command = for() { ping google.com }

agent.sources.seqGenSrc.channels = memoryChannel

agent.sinks.loggerSink.type = file_roll

agent.sinks.loggerSink.channel = memoryChannel
agent.sinks.loggerSink.sink.directory = D:\\TMP\\flu\\
agent.sinks.loggerSink.serializer = text
agent.sinks.loggerSink.appendNewline = false
agent.sinks.loggerSink.rollInterval = 0

agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

To run the example go to FLUME_INSTALLATION_PATH and execute

java -Xmx20m -Dlog4j.configuration=file:///%CD%\conf\log4j.properties -cp .\lib\* org.apache.flume.node.Application -f conf\flume-conf.properties -n agent

OR you may create your java application that has flume libraries in a classpath and call org.apache.flume.node.Application instance from the application passing corresponding arguments.

How to setup Flume to collect and transport logs?

You can use some script for gathering logs from the specified location

agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = your script here

instead of windows script you also can launch java application (put 'java path_to_main_class arguments' in field) which provides smart logs collection. For example, if the file is modified in real-time you can use Tailer from Apache Commons IO. To configure the Flume to transport the log infromation read this article

3. Get the Flume stream from your source code and analyze it with Spark. Take a look on a code sample from github https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java

这篇关于使用Apache火花流的实时日志处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆