Spark:如何获取写入的行数? [英] Spark: how to get the number of written rows?

查看:208
本文介绍了Spark:如何获取写入的行数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有一种方法可以知道Spark保存操作写入的行数.我知道在编写RDD之前就已经可以对它进行计数了,但是我想知道是否有一种方法可以不做就获得相同的信息.

I'm wondering if there is a way to know the number of lines written by a Spark save operation. I know that it's enough to do a count on the RDD before writing it, but I'd like to know if there is a way to have the same info without doing it.

谢谢你, 马可

推荐答案

如果确实需要,可以添加自定义侦听器并从outputMetrics中提取写入的行数.非常简单的示例如下所示:

If you really want you can add custom listener and extract number of written rows from outputMetrics. Very simple example can look like this:

import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}

var recordsWrittenCount = 0L

sc.addSparkListener(new SparkListener() { 
  override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
    synchronized {
      recordsWrittenCount += taskEnd.taskMetrics.outputMetrics.recordsWritten 
    }
  }
})

sc.parallelize(1 to 10, 2).saveAsTextFile("/tmp/foobar")
recordsWrittenCount
// Long = 10

但是API的这一部分仅供内部使用.

but this part of the API is intended for internal usage.

这篇关于Spark:如何获取写入的行数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆