Spark:以.gz和.bz2读取时的区别 [英] Spark: difference when read in .gz and .bz2
问题描述
我通常使用.gz在Spark中读写文件,该文件的数量应与RDD分区的数量相同. IE.一个巨大的.gz文件将读入一个分区.但是,如果我只读一个.bz2,我还会得到一个大分区吗?还是Spark支持将一个.bz2自动拆分为多个分区?
I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?
此外,当Hadoop从一个bz2文件中读取Hadoop时,我怎么知道会有多少分区.谢谢!
Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!
推荐答案
However, if I read in one single .bz2, would I still get one single giant partition?
Or will Spark support automatic split one .bz2 to multiple partitions?
如果您指定n
分区来读取bzip2
文件,Spark将产生n
任务以并行读取文件. n
的默认值设置为sc.defaultParallelism
.分区数是对textFile
(
If you specify n
partitions to read a bzip2
file, Spark will spawn n
tasks to read the file in parallel. The default value of n
is set to sc.defaultParallelism
. The number of partitions is the second argument in the call to textFile
(docs).
. one giant .gz file will read in to a single partition.
请注意,您始终可以进行
Please note that you can always do a
sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)
在读取文件后获得所需的分区数.
to get the desired number of partitions after the file has been read.
Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file
.
对于scala api,该值为yourRDD.partitions.size
;对于python API,该值为yourRDD.getNumPartitions()
.
That would be yourRDD.partitions.size
for the scala api or yourRDD.getNumPartitions()
for the python api.
这篇关于Spark:以.gz和.bz2读取时的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!