Scala中的Apache Spark日志记录 [英] Apache Spark logging within Scala

查看:450
本文介绍了Scala中的Apache Spark日志记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种解决方案,以便能够在Apache Spark Nodes上执行代码时记录其他数据,以帮助以后调查执行期间可能出现的一些问题.尝试使用诸如com.typesafe.scalalogging.LazyLogging之类的传统解决方案失败,因为无法在Apache Spark等分布式环境上序列化日志实例.

I am looking for a solution to be able to log additional data when executing code on Apache Spark Nodes that could help investigate later some issues that might appear during execution. Trying to use a traditional solution like for example com.typesafe.scalalogging.LazyLogging fails because the log instance cannot be serialized on a distributed environment like Apache Spark.

我已经研究了这个问题,现在我找到的解决方案是使用org.apache.spark.Logging特质:

I've investigated this problem and for now the solution that I found was to use the org.apache.spark.Logging trait like this :

class SparkExample with Logging {
  val someRDD = ...
  someRDD.map {
    rddElement => logInfo(s"$rddElement will be processed.")
    doSomething(rddElement)
  }
}

但是,Logging trait似乎不是Apache Spark的永久解决方案,因为它被标记为@DeveloperApi,并且类文档中提到:

However it looks like the Logging trait is not a permanent solution for Apache Spark because it's marked as @DeveloperApi and the class documentation mentions:

将来的发行版中可能会对此进行更改或删除.

This will likely be changed or removed in future releases.

我想知道-它们是我可以使用的任何已知的日志记录解决方案,并且允许它们在Apache Spark节点上执行RDD时记录数据吗?

I am wondering - are they any known logging solution that I can use and will allow me to log data when the RDDs are executed on Apache Spark nodes ?

@稍后编辑:以下一些评论建议使用Log4J.我已经尝试过使用Log4J,但是在使用Scala类(而不是Scala对象)中的记录器时仍然遇到问题. 这是我的完整代码:

@Later Edit : Some of the comments from below suggest to use Log4J. I've tried using Log4J but I'm still having issues when using logger from a Scala class (and not a Scala object). Here is my full code :

import org.apache.log4j.Logger
import org.apache.spark._

object Main {
 def main(args: Array[String]) {
  new LoggingTestWithRDD().doTest()
 }
}

class LoggingTestWithRDD extends Serializable {

  val log = Logger.getLogger(getClass.getName)

  def doTest(): Unit = {
   val conf = new SparkConf().setMaster("local[4]").setAppName("LogTest")
   val spark = new SparkContext(conf)

   val someRdd = spark.parallelize(List(1, 2, 3))
   someRdd.map {
     element =>
       log.info(s"$element will be processed")
       element + 1
    }
   spark.stop()
 }

}

我看到的例外是:

线程"main"中的异常org.apache.spark.SparkException:任务无法序列化->原因:java.io.NotSerializableException:org.apache.log4j.Logger

Exception in thread "main" org.apache.spark.SparkException: Task not serializable -> Caused by: java.io.NotSerializableException: org.apache.log4j.Logger

推荐答案

您可以使用在
中提出的Akhil解决方案 https://www.mail-archive.com/user@spark.apache.org/msg29010.html . 我自己使用过并且有效.

You can use Akhil's solution proposed in
https://www.mail-archive.com/user@spark.apache.org/msg29010.html. I have used by myself and it works.

Akhil Das星期一,2015年5月25日08:20:40 -0700
尝试这种方式:

Akhil Das Mon, 25 May 2015 08:20:40 -0700
Try this way:

object Holder extends Serializable {      
   @transient lazy val log = Logger.getLogger(getClass.getName)    
}


val someRdd = spark.parallelize(List(1, 2, 3)).foreach { element =>
   Holder.log.info(element)
}

这篇关于Scala中的Apache Spark日志记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆