如何分区数量影响`wholeTextFiles`和`textFiles`? [英] How does the number of partitions affect `wholeTextFiles` and `textFiles`?

查看:654
本文介绍了如何分区数量影响`wholeTextFiles`和`textFiles`?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在火花,我知道如何使用 wholeTextFiles TEXTFILES ,但我不知道这在使用时。以下是我目前所知:

In the spark, I understand how to use wholeTextFiles and textFiles, but I'm not sure which to use when. Here is what I know so far:


  • 当不受线分割的文件时,应该使用 wholeTextFiles ,否则使用 TEXTFILES

  • When dealing with files that are not split by line, one should use wholeTextFiles, otherwise use textFiles.

我会在默认情况下, wholeTextFiles TEXTFILES 分区按文件内容,并通过线分别认为。但是,他们都允许您更改参数 minPartitions

I would think that by default, wholeTextFiles and textFiles partition by file content, and by lines, respectively. But, both of them allow you to change the parameter minPartitions.

那么,如何更改分区如何影响这些被处理?

So, how does changing the partitions affect how these are processed?

举例来说,假设我有100线中的一个非常大的文件。什么是处理为 wholeTextFiles 100 partiions,加工为文本文件(由划分它行之间的差异通过线),使用分区之前100的默认

For example, say I have one very large file with 100 lines. What would be the difference between processing it as wholeTextFiles with 100 partiions, and processing it as textFile (which partitions it line by line) using the default of parition 100.

什么是它们之间的区别?

What is the difference between these?

推荐答案

有关参考, wholeTextFiles 使用 WholeTextFileInputFormat 这扩展<一个href=\"https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/ma$p$pduce/lib/input/CombineFileInputFormat.html\"相对=nofollow> CombineFileInputFormat 。

For reference, wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat.

这是一对夫妇的音符 wholeTextFiles


  • wholeTextFiles 返回在RDD每个记录的文件名和文件的全部内容。这意味着文件不能被分割(所有)。

  • 因为它扩展 CombineFileInputFormat ,它会尝试较小的文件组结合成一个分区。

  • Each record in the RDD returned by wholeTextFiles has the file name and the entire contents of the file. This means that a file cannot be split (at all).
  • Because it extends CombineFileInputFormat, it will try to combine groups of smaller files into one partition.

如果我在一个目录中的两个小的文件,它是可能的,这两个文件都在一个单一的分区结束。如果我设置 minPartitions = 2 ,那么我可能会得到两个分区,而不是回来。

If I have two small files in a directory, it is possible that both files will end up in a single partition. If I set minPartitions=2, then I will likely get two partitions back instead.

现在,如果我要设置 minPartitions = 3 ,我依然会回来​​两个分区,因为合约 wholeTextFiles 的是,在RDD每个记录包含整个文件

Now if I were to set minPartitions=3, I will still get back two partitions because the contract for wholeTextFiles is that each record in the RDD contain an entire file.

这篇关于如何分区数量影响`wholeTextFiles`和`textFiles`?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆