使用 apache spark 流进行实时日志处理 [英] real time log processing using apache spark streaming

查看:29
本文介绍了使用 apache spark 流进行实时日志处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个可以实时读取日志的系统,并使用apache spark来处理它.我很困惑,如果我应该使用 kafka 或水槽之类的东西将日志传递给 Spark 流,还是应该使用套接字传递日志.我已经浏览了 Spark 流文档中的示例程序 - Spark 流示例.但是,如果有人能指导我更好地将日志传递给火花流,我将不胜感激.对我来说,这是一个新的领域.

I want to create a system where I can read logs in real time, and use apache spark to process it. I am confused if I should use something like kafka or flume to pass the logs to spark stream or should I pass the logs using sockets. I have gone through a sample program in the spark streaming documentation- Spark stream example. But I will be grateful if someone can guide me a better way to pass logs to spark stream. Its kind of a new turf to me.

推荐答案

Apache Flume 可能有助于实时读取日志.Flume 为应用程序提供日志收集和传输,其中使用 Spark Streaming 分析所需信息.

Apache Flume may help to read the logs in real time. Flume provides logs collection and transport to the application where Spark Streaming is used to analyze required information.

1.官方网站 下载 Apache Flume 或关注来自此处

1. Download Apache Flume from official site or follow the instructions from here

2. 设置和运行 Flume从Flume安装目录(FLUME_INSTALLATION_PATH\conf)修改flume-conf.properties.template,这里需要提供log source、channel和sinks(输出).有关设置的更多详细信息此处

2. Setup and run Flume modify flume-conf.properties.template from the directory where Flume is installed (FLUME_INSTALLATION_PATH\conf), here you need to provide logs source, channel and sinks (output). More details about setup here

有一个启动flume的例子,它从windows主机上运行的ping命令收集日志信息并将其写入文件:

There is an example of launching flume which collects log information from ping comand running on windows host and writes it to a file:

flume-conf.properties

agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink

agent.sources.seqGenSrc.type = exec
agent.sources.seqGenSrc.shell = powershell -Command

agent.sources.seqGenSrc.command = for() { ping google.com }

agent.sources.seqGenSrc.channels = memoryChannel

agent.sinks.loggerSink.type = file_roll

agent.sinks.loggerSink.channel = memoryChannel
agent.sinks.loggerSink.sink.directory = D:\\TMP\\flu\\
agent.sinks.loggerSink.serializer = text
agent.sinks.loggerSink.appendNewline = false
agent.sinks.loggerSink.rollInterval = 0

agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

要运行示例,请转到 FLUME_INSTALLATION_PATH 并执行

To run the example go to FLUME_INSTALLATION_PATH and execute

java -Xmx20m -Dlog4j.configuration=file:///%CD%\conf\log4j.properties -cp .\lib\* org.apache.flume.node.Application -f conf\flume-conf.properties -n agent

或者,您可以创建在类路径中具有水槽库的 Java 应用程序,并从传递相应参数的应用程序调用 org.apache.flume.node.Application 实例.

OR you may create your java application that has flume libraries in a classpath and call org.apache.flume.node.Application instance from the application passing corresponding arguments.

如何设置 Flume 来收集和传输日志?

How to setup Flume to collect and transport logs?

您可以使用一些脚本从指定位置收集日志

You can use some script for gathering logs from the specified location

agent.sources.seqGenSrc.shell = powershell -Command
agent.sources.seqGenSrc.command = your script here

您还可以启动提供智能日志收集的 java 应用程序(在字段中放置java path_to_main_class 参数")而不是 windows 脚本.例如,如果文件被实时修改,您可以使用 Tailer 来自 Apache Commons IO.要配置 Flume 以传输日志信息,请阅读此 文章

instead of windows script you also can launch java application (put 'java path_to_main_class arguments' in field) which provides smart logs collection. For example, if the file is modified in real-time you can use Tailer from Apache Commons IO. To configure the Flume to transport the log infromation read this article

3. 从源代码中获取 Flume 流并使用 Spark 进行分析.查看来自 github https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java

3. Get the Flume stream from your source code and analyze it with Spark. Take a look on a code sample from github https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java

这篇关于使用 apache spark 流进行实时日志处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆