在Spark中高效读取嵌套镶木地板列 [英] Efficient reading nested parquet column in Spark

查看:128
本文介绍了在Spark中高效读取嵌套镶木地板列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下(简化的)模式:

I have following (simplified) schema:

root
 |-- event: struct (nullable = true)
 |    |-- spent: struct (nullable = true)
 |    |    |-- amount: decimal(34,3) (nullable = true)
 |    |    |-- currency: string (nullable = true)
 |    |
 |    | ... ~ 20 other struct fields on "event" level

我正在尝试对嵌套字段求和

I'm trying to sum on nested field

spark.sql("select sum(event.spent.amount) from event")

根据spark指标,我正在从磁盘读取18 GB数据,这需要2.5分钟.

According to spark metrics I'm reading 18 GB from disk and it takes 2.5 min.

但是,当我选择顶级字段时:

However when I select the top level field:

 spark.sql("select sum(amount) from event")

我在4秒钟内只能读取2GB.

I read only 2GB in 4 seconds.

从物理计划中我可以看到,在嵌套结构的情况下,从镶木地板中读取具有所有字段整个事件结构,这很浪费.

From the physical plan I can see that in case of nested structure the whole event struct with all fields are read from parquet, which is a waste.

镶木地板格式应该能够从嵌套结构中提供所需的列,而无需全部读取(这是列式存储的关键).在Spark中有什么方法可以有效地做到这一点吗?

Parquet format should be able to provide the desired column from nested structure without reading it all (which is the point of columnar store). Is there some way to do this efficiently in Spark ?

推荐答案

解决方案:

spark.sql("set spark.sql.optimizer.nestedSchemaPruning.enabled=true")
spark.sql("select sum(amount) from (select event.spent.amount as amount from event_archive)")

查询必须以子选择方式编写.您不能将所选列包装在聚合函数中.以下查询将破坏架构修剪:

The query must be written in sub-select fashion. You can't wrap the selected column in aggregate function. Following query will break schema pruning:

select sum(event.spent.amount) as amount from event

SPARK-4502

肮脏的解决方法也可以在加载时指定计划的架构":

Dirty workaround can be also specifying "projected schema" at load time:

val DecimalType = DataTypes.createDecimalType(18, 4)
val schema = StructType(StructField("event", StructType(
      StructField("spent", StructType(
          StructField("amount", DecimalType, true) :: Nil
      ), true) :: Nil
    ), true) :: Nil
  )
 val df = spark.read.format("parquet").schema(schema).load(<path>)

这篇关于在Spark中高效读取嵌套镶木地板列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆