Apache Spark:获取每个分区的记录数 [英] Apache Spark: Get number of records per partition

查看:561
本文介绍了Apache Spark:获取每个分区的记录数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检查一下如何获取有关每个分区的信息,例如总数.当将Spark作业以部署模式提交为纱簇以在控制台上记录或打印时,在驱动程序侧的每个分区中记录的数量.

I want to check how can we get information about each partition such as total no. of records in each partition on driver side when Spark job is submitted with deploy mode as a yarn cluster in order to log or print on the console.

推荐答案

您可以像这样获得每个分区的记录数:

You can get the number of records per partition like this :

df
  .rdd
  .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
  .toDF("partition_number","number_of_records")
  .show

但是这也将单独启动一个Spark Job(因为必须由spark读取文件才能获取记录数).

But this will also launch a Spark Job by itself (because the file must be read by spark to get the number of records).

Spark还可以读取配置单元表统计信息,但是我不知道如何显示这些元数据.

Spark could may also read hive table statistics, but I don't know how to display those metadata..

这篇关于Apache Spark:获取每个分区的记录数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆