Spark textFile 与 WholeTextFiles [英] Spark textFile vs wholeTextFiles

查看：35 发布时间：2021/11/12 5:45:09 scala apache-spark file-io

本文介绍了Spark textFile 与 WholeTextFiles的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我了解textFile为每个文件生成partition的基本原理，而wholeTextFiles生成一个对值的RDD，其中key是每个文件的路径，value 是每个文件的内容.

I understand the basic theory of textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.

现在，从技术角度来看，两者之间有什么区别:

Now, from a technical point of view, what's the difference between :

val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions

和

val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions

在这两种方法中，我都生成了 8 个分区.那么为什么我首先要使用 wholeTextFiles，它比 textFile 有什么好处?

In both methods I'm generating 8 partitions. So why should I use wholeTextFiles in the first place, and what's its benefit over textFile?

推荐答案

正如你提到的，主要区别在于 textFile 将返回一个 RDD，每行作为一个元素，而 wholeTextFiles 返回一个 PairRDD，键是文件路径.如果不需要根据文件分离数据，只需使用textFile.

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.

当使用textFile 读取未压缩文件时，它会将数据拆分为32MB 的数据块.从内存的角度来看，这是有利的.这也意味着行的顺序丢失了，如果应该保留顺序，则应该使用 wholeTextFiles.

When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.

wholeTextFiles 将一次读取文件的完整内容，不会部分溢出到磁盘或部分垃圾收集.每个文件将由一个核心处理，并且每个文件的数据将是一台机器，这使得分配负载变得更加困难.

wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.

这篇关于Spark textFile 与 WholeTextFiles的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark textFile 与 WholeTextFiles [英] Spark textFile vs wholeTextFiles

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark textFile 与 WholeTextFiles [英] Spark textFile vs wholeTextFiles

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭