Spark textFile与WholeTextFiles [英] Spark textFile vs wholeTextFiles

查看:322
本文介绍了Spark textFile与WholeTextFiles的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解textFile为每个文件生成分区的基本原理,而wholeTextFiles生成对值的RDD,其中键是每个文件的路径,值是每个文件的内容.

I understand the basic theory of textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.

现在,从技术角度来看,:和

Now, from a technical point of view, what's the difference between :

val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions

val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions

在这两种方法中,我都会生成8个分区.那么,为什么我应该首先使用wholeTextFiles呢?与textFile相比,它有什么好处?

In both methods I'm generating 8 partitions. So why should I use wholeTextFiles in the first place, and what's its benefit over textFile?

推荐答案

主要区别在于textFile将以每行作为元素返回一个RDD,而wholeTextFiles将通过键返回PairRDD是文件路径.如果不需要根据文件来分离数据,只需使用textFile.

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.

使用textFile读取未压缩的文件时,会将数据分成32MB的块.从内存的角度来看,这是有利的.这也意味着行的顺序丢失了,如果应保留顺序,则应使用wholeTextFiles.

When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.

wholeTextFiles将立即读取文件的完整内容,不会部分溢出到磁盘或部分垃圾回收.每个文件将由一个内核处理,并且每个文件的数据将是一台计算机,因此很难分配负载.

wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.

这篇关于Spark textFile与WholeTextFiles的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆