Spark textFile 与 WholeTextFiles [英] Spark textFile vs wholeTextFiles
问题描述
我了解textFile
为每个文件生成partition的基本原理,而wholeTextFiles
生成一个对值的RDD,其中key是每个文件的路径,value 是每个文件的内容.
I understand the basic theory of textFile
generating partition for each file, while wholeTextFiles
generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.
现在,从技术角度来看,两者之间有什么区别:
Now, from a technical point of view, what's the difference between :
val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions
和
val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions
在这两种方法中,我都生成了 8 个分区.那么为什么我首先要使用 wholeTextFiles
,它比 textFile
有什么好处?
In both methods I'm generating 8 partitions. So why should I use wholeTextFiles
in the first place, and what's its benefit over textFile
?
推荐答案
正如你提到的,主要区别在于 textFile
将返回一个 RDD,每行作为一个元素,而 wholeTextFiles
返回一个 PairRDD,键是文件路径.如果不需要根据文件分离数据,只需使用textFile
.
The main difference, as you mentioned, is that textFile
will return an RDD with each line as an element while wholeTextFiles
returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile
.
当使用textFile
读取未压缩文件时,它会将数据拆分为32MB 的数据块.从内存的角度来看,这是有利的.这也意味着行的顺序丢失了,如果应该保留顺序,则应该使用 wholeTextFiles
.
When reading uncompressed files with textFile
, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles
should be used.
wholeTextFiles
将一次读取文件的完整内容,不会部分溢出到磁盘或部分垃圾收集.每个文件将由一个核心处理,并且每个文件的数据将是一台机器,这使得分配负载变得更加困难.
wholeTextFiles
will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.
这篇关于Spark textFile 与 WholeTextFiles的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!