为什么查询性能与Spark SQL中的嵌套列不同? [英] Why does the query performance differ with nested columns in Spark SQL?

查看:102
本文介绍了为什么查询性能与Spark SQL中的嵌套列不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Spark SQL以Parquet格式编写了一些数据,结果模式如下:

I write some data in the Parquet format using Spark SQL where the resulting schema looks like the following:

root
|-- stateLevel: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)
|-- countryLevel: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)
|-- global: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)

我还可以将相同的数据转换为更扁平的架构,如下所示:

I can also transform the same data into a more flat schema that looks like this:

root
|-- stateLevelCount1: integer (nullable = false)
|-- stateLevelCount2: integer (nullable = false)
|-- stateLevelCount3: integer (nullable = false)
|-- stateLevelCount4: integer (nullable = false)
|-- stateLevelCount5: integer (nullable = false)
|-- countryLevelCount1: integer (nullable = false)
|-- countryLevelCount2: integer (nullable = false)
|-- countryLevelCount3: integer (nullable = false)
|-- countryLevelCount4: integer (nullable = false)
|-- countryLevelCount5: integer (nullable = false)
|-- globalCount1: integer (nullable = false)
|-- globalCount2: integer (nullable = false)
|-- globalCount3: integer (nullable = false)
|-- globalCount4: integer (nullable = false)
|-- globalCount5: integer (nullable = false)

现在,当我在诸如global.count1之类的列上对第一个数据集运行查询时,它比在第二个数据集中查询globalCount1花费的时间要长得多.相反,将第一个数据集写入Parquet所需的时间比写入第二个数据集要短得多.我知道由于Parquet,我的数据是以列方式存储的,但是我一直认为所有嵌套列都将单独存储在一起.例如,在第一个数据集中,似乎整个"global"列都存储在一起,而不是将"global.count1","global.count2"等值存储在一起.这是预期的行为吗?

Now when I run a query on the first data set on a column like global.count1, it takes a lot longer than querying globalCount1 in the second data set. Conversely, writing the first data set into Parquet takes a lot shorter than writing the 2nd data set. I know that my data is stored in a columnar fashion due to Parquet, but I was thinking that all the nested columns would be stored together individually. In the 1st data set for instance, it seems to that the whole 'global' column is being stored together as opposed to 'global.count1', 'global.count2' etc. values being stored together. Is this expected behavior?

推荐答案

有趣. 比查询要花很多时间."您能分享多长时间吗?谢谢.

Interesting. "it takes a lot longer than querying.. " can you please share how much longer? thanks.

查看代码 https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/RecordReaderImplementation.java#L248 从结构读取可能会有一些开销.仅查看Parquet代码,不应更长".

Looking at the code https://github.com/Parquet/parquet-mr/blob/master/parquet-column/src/main/java/parquet/io/RecordReaderImplementation.java#L248 it seems that reading from structures might have some overhead. It shouldn't be "a lot longer" though just looking at the Parquet code.

我认为更大的问题是Spark在这种情况下如何下推谓词.例如,在这种情况下可能无法使用布隆过滤器.您能否在情况和时间上共享查询数据的方式.哪些版本的Spark,Parquet,Hadoop等?

I think bigger problem is how Spark can push down predicates in such cases. For example, it may not be able to use bloom filters in such cases. Can you please share how you query data in both cases and timings. Which versions of Spark, Parquet, Hadoop etc?

Parquet-1.5发行了 https://issues.apache.org/jira/浏览器/PARQUET-61 ,在某些情况下可能会导致速度降低4-5倍.

Parquet-1.5 had issue https://issues.apache.org/jira/browse/PARQUET-61 which in some such cases could cause 4-5x slowdown.

这篇关于为什么查询性能与Spark SQL中的嵌套列不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆