Apache Spark:获取每个分区的记录数 [英] Apache Spark: Get number of records per partition

查看:75
本文介绍了Apache Spark:获取每个分区的记录数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检查我们如何获取有关每个分区的信息,例如总数.当 Spark 作业以部署模式作为纱线集群提交以在控制台上记录或打印时,驱动程序端每个分区中的记录.

I want to check how can we get information about each partition such as total no. of records in each partition on driver side when Spark job is submitted with deploy mode as a yarn cluster in order to log or print on the console.

推荐答案

您可以像这样获取每个分区的记录数:

You can get the number of records per partition like this :

df
  .rdd
  .mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
  .toDF("partition_number","number_of_records")
  .show

但这也会自己启动一个Spark Job(因为文件必须被spark读取才能获得记录数).

But this will also launch a Spark Job by itself (because the file must be read by spark to get the number of records).

Spark 也可以读取 hive 表统计信息,但我不知道如何显示这些元数据..

Spark could may also read hive table statistics, but I don't know how to display those metadata..

这篇关于Apache Spark:获取每个分区的记录数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆