为什么聚合的 Spark Parquet 文件比原始文件大? [英] Why are Spark Parquet files for an aggregate larger than the original?

查看：30 发布时间：2021/11/12 5:41:29 apache-spark storage aggregation parquet

本文介绍了为什么聚合的 Spark Parquet 文件比原始文件大?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试为最终用户创建一个聚合文件，以避免让他们处理具有更大文件的多个源.为此，我:A) 遍历所有源文件夹，去除最常请求的 12 个字段，在这些结果位于同一位置的新位置分离出镶木地板文件.B) 我尝试回顾在步骤 A 中创建的文件，并通过按 12 个字段分组来重新聚合它们，将其缩减为每个唯一组合的汇总行.

I am trying to create an aggregate file for end users to utilize to avoid having them process multiple sources with much larger files. To do that I: A) iterate through all source folders, stripping out 12 fields that are most commonly requested, spinning out parquet files in a new location where these results are co-located. B) I try to go back through the files created in step A and re-aggregate them by grouping by the 12 fields to reduce it to a summary row for each unique combination.

我发现步骤 A 以 5:1 的比例减少了有效负载(大约 250 个演出变成了 48.5 个演出).然而，步骤 B 不是进一步减少，而是比步骤 A 增加 50%.但是，我的计数匹配.

What I'm finding is that step A reduces the payload 5:1 (roughly 250 gigs becomes 48.5 gigs). Step B however, instead of further reducing this, increase by 50% over step A. However, my counts match.

这是使用 Spark 1.5.2
我的代码经过修改，只是为了用 field1...field12 替换字段名称以使其更具可读性，下面是我注意到的结果.

This is using Spark 1.5.2
My code, modified only to replace the field names with field1...field12 to make it more readable, is below with the results I've noted.

虽然我不一定期望再减少 5:1，但我不知道我做错了什么来增加具有相同架构的更少行的存储量.谁能帮助我理解我做错了什么?

While I don't necessarily expect another 5:1 reduction, I don't know what I'm doing incorrectly to increase the storage side for less rows with the same schema. Anyone able to help me understand what I did wrong?

谢谢！

//for each eventName found in separate source folders, do the following:
//spit out one row with key fields from the original dataset for quicker availability to clients 
//results in a 5:1 reduction in size
val sqlStatement = "Select field1, field2, field3, field4, field5, field6, field7, field8, field9, field10, field11, field12, cast(1 as bigint) as rCount from table"
sqlContext.sql(sqlCommand).coalesce(20).write.parquet("<aws folder>" + dt + "/" + eventName + "/")
//results in over 700 files with a total of  16,969,050,506 rows consuming 48.65 gigs of storage space in S3, compressed 

//after all events are processed, aggregate the results
val sqlStatement = "Select field1, field2, field3, field4, field5, field6, field7, field8, field9, field10, field11, field12, sum(rCount) as rCount from results group by field1, field2, field3, field4, field5, field6, field7, field8, field9, field10, field11, field12"
//Use a wildcard to search all sub-folders created above
sqlContext.read.parquet("<aws folder>" + dt + "/*/").registerTempTable("results")
sqlContext.sql(sqlStatement).coalesce(20).saveAsParquetFile("<a new aws folder>" + dt + "/")
//This results in  3,295,206,761 rows with an aggregate value of 16,969,050,506 for rCount but consumes 79.32 gigs of storage space in S3, compressed

//The parquet schemas created (both tables match):
 |-- field1: string (nullable = true) (10 characters)
 |-- field2: string (nullable = true) (15 characters)
 |-- field3: string (nullable = true) (50 characters max)
 |-- field4: string (nullable = true) (10 characters)
 |-- field5: string (nullable = true) (10 characters)
 |-- field6: string (nullable = true) (10 characters)
 |-- field7: string (nullable = true) (16 characters)
 |-- field8: string (nullable = true) (10 characters)
 |-- field9 string (nullable = true)  (15 characters)
 |-- field10: string (nullable = true)(20 characters)
 |-- field11: string (nullable = true)(14 characters)
 |-- field12: string (nullable = true)(14 characters)
 |-- rCount: long (nullable = true)   
 |-- dt: string (nullable = true)

为什么聚合的 Spark Parquet 文件比原始文件大? [英] Why are Spark Parquet files for an aggregate larger than the original?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么聚合的 Spark Parquet 文件比原始文件大? [英] Why are Spark Parquet files for an aggregate larger than the original?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭