Spark:以.gz和.bz2读取时的区别 [英] Spark: difference when read in .gz and .bz2

查看:420
本文介绍了Spark:以.gz和.bz2读取时的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通常使用.gz在Spark中读写文件,该文件的数量应与RDD分区的数量相同. IE.一个巨大的.gz文件将读入一个分区.但是,如果我只读一个.bz2,我还会得到一个大分区吗?还是Spark支持将一个.bz2自动拆分为多个分区?

I normally read and write files in Spark using .gz, which the number of files should be the same as the number of RDD partitions. I.e. one giant .gz file will read in to a single partition. However, if I read in one single .bz2, would I still get one single giant partition? Or will Spark support automatic split one .bz2 to multiple partitions?

此外,当Hadoop从一个bz2文件中读取Hadoop时,我怎么知道会有多少分区.谢谢!

Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file. Thanks!

推荐答案

    However, if I read in one single .bz2, would I still get one single giant partition?   
Or will Spark support automatic split one .bz2 to multiple partitions?

如果您指定n分区来读取bzip2文件,Spark将产生n任务以并行读取文件. n的默认值设置为sc.defaultParallelism.分区数是对textFile(

If you specify n partitions to read a bzip2 file, Spark will spawn n tasks to read the file in parallel. The default value of n is set to sc.defaultParallelism. The number of partitions is the second argument in the call to textFile (docs).

. one giant .gz file will read in to a single partition.

请注意,您始终可以进行

Please note that you can always do a

sc.textFile(myGiantGzipFile).repartition(desiredNumberOfPartitions)

在读取文件后获得所需的分区数.

to get the desired number of partitions after the file has been read.


Also, how do I know how many partitions it would be while Hadoop read in it from one bz2 file.

对于scala api,该值为yourRDD.partitions.size;对于python API,该值为yourRDD.getNumPartitions().

That would be yourRDD.partitions.size for the scala api or yourRDD.getNumPartitions() for the python api.

这篇关于Spark:以.gz和.bz2读取时的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆