为什么查询性能与Spark SQL中的嵌套列不同? [英] Why does the query performance differ with nested columns in Spark SQL?

查看：102 发布时间：2020/9/4 20:20:49 apache-spark-sql parquet

本文介绍了为什么查询性能与Spark SQL中的嵌套列不同?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Spark SQL以Parquet格式编写了一些数据，结果模式如下:

I write some data in the Parquet format using Spark SQL where the resulting schema looks like the following:

root
|-- stateLevel: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)
|-- countryLevel: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)
|-- global: struct (nullable = true)
|    |-- count1: integer (nullable = false)
|    |-- count2: integer (nullable = false)
|    |-- count3: integer (nullable = false)
|    |-- count4: integer (nullable = false)
|    |-- count5: integer (nullable = false)

我还可以将相同的数据转换为更扁平的架构，如下所示:

I can also transform the same data into a more flat schema that looks like this:

root
|-- stateLevelCount1: integer (nullable = false)
|-- stateLevelCount2: integer (nullable = false)
|-- stateLevelCount3: integer (nullable = false)
|-- stateLevelCount4: integer (nullable = false)
|-- stateLevelCount5: integer (nullable = false)
|-- countryLevelCount1: integer (nullable = false)
|-- countryLevelCount2: integer (nullable = false)
|-- countryLevelCount3: integer (nullable = false)
|-- countryLevelCount4: integer (nullable = false)
|-- countryLevelCount5: integer (nullable = false)
|-- globalCount1: integer (nullable = false)
|-- globalCount2: integer (nullable = false)
|-- globalCount3: integer (nullable = false)
|-- globalCount4: integer (nullable = false)
|-- globalCount5: integer (nullable = false)

现在，当我在诸如global.count1之类的列上对第一个数据集运行查询时，它比在第二个数据集中查询globalCount1花费的时间要长得多.相反，将第一个数据集写入Parquet所需的时间比写入第二个数据集要短得多.我知道由于Parquet，我的数据是以列方式存储的，但是我一直认为所有嵌套列都将单独存储在一起.例如，在第一个数据集中，似乎整个"global"列都存储在一起，而不是将"global.count1"，"global.count2"等值存储在一起.这是预期的行为吗?

Now when I run a query on the first data set on a column like global.count1, it takes a lot longer than querying globalCount1 in the second data set. Conversely, writing the first data set into Parquet takes a lot shorter than writing the 2nd data set. I know that my data is stored in a columnar fashion due to Parquet, but I was thinking that all the nested columns would be stored together individually. In the 1st data set for instance, it seems to that the whole 'global' column is being stored together as opposed to 'global.count1', 'global.count2' etc. values being stored together. Is this expected behavior?

为什么查询性能与Spark SQL中的嵌套列不同? [英] Why does the query performance differ with nested columns in Spark SQL?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么查询性能与Spark SQL中的嵌套列不同? [英] Why does the query performance differ with nested columns in Spark SQL?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭