Spark textFile与WholeTextFiles [英] Spark textFile vs wholeTextFiles
问题描述
我了解textFile
为每个文件生成分区的基本原理,而wholeTextFiles
生成对值的RDD,其中键是每个文件的路径,值是每个文件的内容.
I understand the basic theory of textFile
generating partition for each file, while wholeTextFiles
generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.
现在,从技术角度来看,:和
Now, from a technical point of view, what's the difference between :
val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions
和
val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions
在这两种方法中,我都会生成8个分区.那么,为什么我应该首先使用wholeTextFiles
呢?与textFile
相比,它有什么好处?
In both methods I'm generating 8 partitions. So why should I use wholeTextFiles
in the first place, and what's its benefit over textFile
?
推荐答案
主要区别在于textFile
将以每行作为元素返回一个RDD,而wholeTextFiles
将通过键返回PairRDD是文件路径.如果不需要根据文件来分离数据,只需使用textFile
.
The main difference, as you mentioned, is that textFile
will return an RDD with each line as an element while wholeTextFiles
returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile
.
使用textFile
读取未压缩的文件时,会将数据分成32MB的块.从内存的角度来看,这是有利的.这也意味着行的顺序丢失了,如果应保留顺序,则应使用wholeTextFiles
.
When reading uncompressed files with textFile
, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles
should be used.
wholeTextFiles
将立即读取文件的完整内容,不会部分溢出到磁盘或部分垃圾回收.每个文件将由一个内核处理,并且每个文件的数据将是一台计算机,因此很难分配负载.
wholeTextFiles
will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.
这篇关于Spark textFile与WholeTextFiles的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!