星火DataFrames与实木复合地板和分区 [英] Spark DataFrames with Parquet and Partitioning

查看：170 发布时间：2016/5/22 16:16:23 apache-spark apache-spark-sql parquet

本文介绍了星火DataFrames与实木复合地板和分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我一直没能找到有关此主题的大量信息，但可以说我们使用了一个数据帧的拼花文件，是10块火花自然会创建10个分区阅读。但是，当数据框的文件来处理它读取，会不会是处理大量数据的分区比例，因为如果它是处理文件uncom pressed块大小将是更大意义上的分区将更大为好。

I have not been able to find much information on this topic but lets say we use a dataframe to read in a parquet file that is 10 Blocks spark will naturally create 10 partitions. But when the dataframe reads in the file to process it, won't it be processing a large data to partition ratio because if it was processing the file uncompressed the block size would have been much larger meaning the partitions would be larger as well.

因此，让我澄清一下，实木复合地板融为一体pressed（这些数字并不完全准确）。
1GB PAR = 5块= 5个分区可能被DECOM pressed到5GB使其25块/ 25分区。但是，除非你重新分区，你会只用5个分区被卡1GB的par文件时，最佳的将是25分区？或者是我的逻辑是错误的。

So let me clarify, parquet compressed (these numbers are not fully accurate). 1GB Par = 5 Blocks = 5 Partitions which might be decompressed to 5GB making it 25 blocks/25 partitions. But unless you repartition the 1GB par file you will be stuck with just 5 partitions when optimally it would be 25 partitions? Or is my logic wrong.

将是有意义的重新分区，以提高速度？还是我在思考这个问题。任何人都可以阐明这一些轻？

Would make sense to repartition to increase speed? Or am I thinking about this wrong. Can anyone shed some light on this?

假设：

1块= 1分区火花

1核心1分区操作

星火DataFrames与实木复合地板和分区 [英] Spark DataFrames with Parquet and Partitioning

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火DataFrames与实木复合地板和分区 [英] Spark DataFrames with Parquet and Partitioning

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭