什么是RDDS片和分区之间的区别是什么? [英] What are the differences between slices and partitions of RDDs?

查看:454
本文介绍了什么是RDDS片和分区之间的区别是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用星火的Python的API和运行星火0.8。

我存储浮点矢量大RDD,我需要执行一个向量的计算,对整个集。

有没有在片RDD和分区有什么区别?

当我创建RDD,我通过它作为100导致它的RDD存储为100片,并进行计算时,创建100个任务的参数。我想知道如果分割数据将通过使系统更有效地处理数据改善超出了平滑性能(即有超过一个分区与以上刚刚超过在切片RDD每个元素操作执行操作之间的差)。

例如,是否有这两个code片之间有任何显著区别?

  RDD = sc.textFile(demo.txt,100)

VS

  RDD = sc.textFile(demo.txt)
rdd.partitionBy(100)


解决方案

我相信分区是Apache中的星火同样的事情。

不过,有两件$ C $的c您发布之间的微妙但潜在的显著差异。

这code会的尝试的加载 demo.txt 使用100并发任务直接到100个分区:

  RDD = sc.textFile('demo.txt',100)

有关uncom pressed文字,它会正常工作。但是,如果不是 demo.txt 你有一个 demo.gz ,你最终会与RDD只有1划分。针对读取gzip文件不能并行。

在另一方面,下面的code会先打开 demo.txt 与分区的默认数量RDD,的然后这将明确地将数据重新分区成的面积大致相等的100个分区。

  RDD = sc.textFile('demo.txt')
RDD = rdd.repartition(100)

因此​​,在这种情况下,即使有 demo.gz 您将结束与RDD 100分区。

作为一个方面说明,我换成你的 partitionBy()重新分区(),因为这就是我相信你所期待的。 partitionBy()要求RDD是元组的RDD。由于重新分区()不在星火0.8.0可用,您应该可以使用 COALESCE(100,随机播放= TRUE)

星火可以在集群为RDD的每个分区上运行1个并发任务,最多核心数量。所以,如果你有50核心的产业集群,你希望你的RDDS至少有50个分区(可能的 2-3倍倍的)。

火花1.1.0,您可以检查RDD多少个分区具有如下:

  rdd.getNumPartitions()#的Python API
rdd.partitions.size //斯卡拉API

1.1.0之前,方式与Python的API做的,这是 rdd._jrdd.splits()。大小()

I am using Spark's Python API and running Spark 0.8.

I am storing a large RDD of floating point vectors and I need to perform calculations of one vector against the entire set.

Is there any difference between slices and partitions in an RDD?

When I create the RDD, I pass it 100 as a parameter which causes it to store the RDD as 100 slices and create 100 tasks when performing the calculations. I want to know if partitioning the data would improve performance beyond the slicing by enabling the system to process the data more efficiently (i.e. is there a difference between performing operations over a partition versus over just operating over every element in the sliced RDD).

For example, is there any significant difference between these two pieces of code?

rdd = sc.textFile(demo.txt, 100)

vs

rdd = sc.textFile(demo.txt)
rdd.partitionBy(100)

解决方案

I believe slices and partitions are the same thing in Apache Spark.

However, there is a subtle but potentially significant difference between the two pieces of code you posted.

This code will attempt to load demo.txt directly into 100 partitions using 100 concurrent tasks:

rdd = sc.textFile('demo.txt', 100)

For uncompressed text, it will work as expected. But if instead of demo.txt you had a demo.gz, you will end up with an RDD with only 1 partition. Reads against gzipped files cannot be parallelized.

On the other hand, the following code will first open demo.txt into an RDD with the default number of partitions, then it will explicitly repartition the data into 100 partitions that are roughly equal in size.

rdd = sc.textFile('demo.txt')
rdd = rdd.repartition(100)

So in this case, even with a demo.gz you will end up with an RDD with 100 partitions.

As a side note, I replaced your partitionBy() with repartition() since that's what I believe you were looking for. partitionBy() requires the RDD to be an RDD of tuples. Since repartition() is not available in Spark 0.8.0, you should instead be able to use coalesce(100, shuffle=True).

Spark can run 1 concurrent task for every partition of an RDD, up to the number of cores in your cluster. So if you have a cluster with 50 cores, you want your RDDs to at least have 50 partitions (and probably 2-3x times that).

As of Spark 1.1.0, you can check how many partitions an RDD has as follows:

rdd.getNumPartitions()  # Python API
rdd.partitions.size     // Scala API

Before 1.1.0, the way to do this with the Python API was rdd._jrdd.splits().size().

这篇关于什么是RDDS片和分区之间的区别是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆