Spark 在标准输出上丢失 println() [英] Spark losing println() on stdout

查看:25
本文介绍了Spark 在标准输出上丢失 println()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码:

val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
  for (value <- data.getValues()) {
    if (record.getEnum() == DataEnum.BLUE) {
      blueCount += 1
      println("Enum = BLUE : " + value.toString()
    }
  }
  data
}.persist(StorageLevel.MEMORY_ONLY_SER)

output.saveAsTextFile("myOutput")

<小时>

然后 blueCount 不为零,但我没有 println() 输出!我在这里错过了什么吗?谢谢!


Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!

推荐答案

这是一个概念性问题...

This is a conceptual question...

想象一下,你有一个很大的集群,由许多 worker 组成,比如 n 个 worker,这些 worker 存储 RDDDataFrame 的一个分区,想象一下你在那个数据上启动了一个 map 任务,在那个 map 里面你有一个 print 语句,首先:

Imagine You have a big cluster, composed of many workers let's say n workers and those workers store a partition of an RDD or DataFrame, imagine You start a map task across that data, and inside that map you have a print statement, first of all:

  • 在哪里打印这些数据?
  • 哪个节点有优先权,哪个分区?
  • 如果所有节点都并行运行,谁会先打印?
  • 如何创建此打印队列?

这些问题太多了,因此 apache-spark 的设计者/维护者从逻辑上决定放弃对任何 map-reduce<中的 print 语句的任何支持/code> 操作(这包括 accumulators 甚至 broadcast 变量).

Those are too many questions, thus the designers/maintainers of apache-spark decided logically to drop any support to print statements inside any map-reduce operation (this include accumulators and even broadcast variables).

这也是有道理的,因为 Spark 是一种设计用于非常大的数据集的语言.虽然打印对于测试和调试很有用,但您不希望打印 DataFrame 或 RDD 的每一行,因为它们被构建为具有数百万或数十亿行!那么,当您一开始甚至不想打印时,为什么还要处理这些复杂的问题呢?

This also makes sense because Spark is a language designed for very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?

为了证明这一点,您可以运行以下 Scala 代码,例如:

In order to prove this you can run this scala code for example:

// Let's create a simple RDD
val rdd = sc.parallelize(1 to 10000)

def printStuff(x:Int):Int = {
  println(x)
  x + 1
}

// It doesn't print anything! because of a logic design limitation!
rdd.map(printStuff)

// But you can print the RDD by doing the following:
rdd.take(10).foreach(println)

这篇关于Spark 在标准输出上丢失 println()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆