Spark:如何获取写入的行数? [英] Spark: how to get the number of written rows?
本文介绍了Spark:如何获取写入的行数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想知道是否有一种方法可以知道Spark保存操作写入的行数.我知道在编写RDD之前就已经可以对它进行计数了,但是我想知道是否有一种方法可以不做就获得相同的信息.
I'm wondering if there is a way to know the number of lines written by a Spark save operation. I know that it's enough to do a count on the RDD before writing it, but I'd like to know if there is a way to have the same info without doing it.
谢谢你, 马可
推荐答案
如果确实需要,可以添加自定义侦听器并从outputMetrics
中提取写入的行数.非常简单的示例如下所示:
If you really want you can add custom listener and extract number of written rows from outputMetrics
. Very simple example can look like this:
import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
var recordsWrittenCount = 0L
sc.addSparkListener(new SparkListener() {
override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
synchronized {
recordsWrittenCount += taskEnd.taskMetrics.outputMetrics.recordsWritten
}
}
})
sc.parallelize(1 to 10, 2).saveAsTextFile("/tmp/foobar")
recordsWrittenCount
// Long = 10
但是API的这一部分仅供内部使用.
but this part of the API is intended for internal usage.
这篇关于Spark:如何获取写入的行数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文