Spark中的快速镶木地板行数 [英] Fast Parquet row count in Spark

查看：76 发布时间：2020/9/4 3:28:40 apache-spark parquet

本文介绍了Spark中的快速镶木地板行数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Parquet文件包含每个块的行计数字段. Spark似乎在某些时候读过它(

The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151).

我在spark-shell中尝试过此操作:

sqlContext.read.load("x.parquet").count

和Spark运行两个阶段，显示了DAG中的各种聚合步骤.我认为这意味着它会正常读取文件，而不使用行数. (我可能是错的.)

And Spark ran two stages, showing various aggregation steps in the DAG. I figure this means it reads through the file normally instead of using the row counts. (I could be wrong.)

问题是:运行count时，Spark是否已经在使用行计数字段?还有其他API可以使用这些字段吗?出于某些原因，依靠这些领域是个坏主意吗?

The question is: Is Spark already using the row count fields when I run count? Is there another API to use those fields? Is relying on those fields a bad idea for some reason?

推荐答案

是的，当您运行count时，Spark已经在使用rowcounts字段.

That is correct, Spark is already using the rowcounts field when you are running count.

稍微详细介绍一下使用平面模式提交作为 [SPARK-11787]加快用于平面方案的实木复合地板阅读器.请注意，此提交已包含在Spark 1.6分支中.

Diving into the details a bit, the SpecificParquetRecordReaderBase.java references the Improve Parquet scan performance when using flat schemas commit as part of [SPARK-11787] Speed up parquet reader for flat schemas. Note, this commit was included as part of the Spark 1.6 branch.

如果查询是行计数，则它几乎可以按照您描述的方式(即读取元数据)工作.如果谓词完全满足最小值/最大值的要求，那么也应正常运行，尽管尚未得到充分验证.使用这些Parquet字段并不是一个坏主意，但是正如前面的语句所暗示的那样，关键问题是确保谓词过滤与元数据匹配，以便您进行准确的计数.

If the query is a row count, it pretty much works the way you described it (i.e. reading the metadata). If the predicates are fully satisfied by the min/max values, that should work as well though that is not as fully verified. It's not a bad idea to use those Parquet fields but as implied in the previous statement, the key issue is to ensure that the predicate filtering matches the metadata so you are doing an accurate count.

为帮助理解为什么有两个阶段，这是运行count()语句时创建的DAG.

To help understand why there are two stages, here's the DAG created when running the count() statement.

在进入两个阶段时，请注意，第一个阶段(第25阶段)正在运行文件扫描，而第二个阶段(第26阶段)正在运行随机计数.

When digging into the two stages, notice that the first one (Stage 25) is running the file scan while the second stage (Stage 26) runs the shuffle for the count.

感谢Nong Li(

Thanks to Nong Li (the author of the SpecificParquetRecordReaderBase.java commit) for validating!

要在Dataset.count与Parquet之间的桥梁上提供其他上下文，围绕此的内部逻辑流程为:

To provide additional context on the bridge between Dataset.count and Parquet, the flow of the internal logic surrounding this is:

Spark不会读取任何Parquet列来计算计数

将Parquet模式传递给VectorizedParquetRecordReader实际上是一个空的Parquet消息

使用存储在Parquet文件页脚中的元数据来计算计数. 涉及将上述内容包装在迭代器中，该迭代器每个InternalRow/spark/sql/catalyst/InternalRow.scala"rel =" noreferrer> InternalRow.scala .

Spark does not read any Parquet columns to calculate the count
Passing of the Parquet schema to the VectorizedParquetRecordReader is actually an empty Parquet message
Computing the count using the metadata stored in the Parquet file footers. involves the wrapping of the above within an iterator that returns an InternalRow per InternalRow.scala.

要在内部使用Parquet File格式，Apache Spark用一个返回InternalRow的迭代器包装逻辑.可以在

To work with the Parquet File format, internally, Apache Spark wraps the logic with an iterator that returns an InternalRow; more information can be found in InternalRow.scala. Ultimately, the count() aggregate function interacts with the underlying Parquet data source using this iterator. BTW, this is true for both vectorized and non-vectorized Parquet reader.

因此，要将Dataset.count()与Parquet阅读器桥接，路径为:

Therefore, to bridge the Dataset.count() with the Parquet reader, the path is:

Dataset.count()调用被计划为具有单个count()聚合函数的聚合运算符.
Java代码是在计划时为聚合运算符以及count()聚合函数生成的.
生成的Java代码通过RecordReaderIterator与基础数据源ParquetFileFormat进行交互，该数据由Spark数据源API在内部使用.

The Dataset.count() call is planned into an aggregate operator with a single count() aggregate function.
Java code is generated at planning time for the aggregate operator as well as the count() aggregate function.
The generated Java code interacts with the underlying data source ParquetFileFormat with an RecordReaderIterator, which is used internally by the Spark data source API.

有关更多信息，请参阅镶木地板计算元数据说明.

For more information, please refer to Parquet Count Metadata Explanation.

这篇关于Spark中的快速镶木地板行数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark中的快速镶木地板行数 [英] Fast Parquet row count in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark中的快速镶木地板行数 [英] Fast Parquet row count in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭