为什么过滤器不保留分区? [英] Why filter does not preserve partitioning?

查看:96
本文介绍了为什么过滤器不保留分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是来自 jaceklaskowski.gitbooks.io的引用.

某些操作,例如地图,flatMap,过滤器,不保留分区.地图,flatMap,过滤器操作将一个函数应用于每个分区.

Some operations, e.g. map, flatMap, filter, don’t preserve partitioning. map, flatMap, filter operations apply a function to every partition.

我不明白为什么过滤器不能保留分区.它只是获得满足条件的每个分区的子集,因此我认为可以保留分区.为什么不是这样?

I don't understand why filter does not preserve partitioning. It's just getting a subset of each partition which satisfy a condition so I think partitions can be preserved. Why isn't it like that?

推荐答案

您当然是正确的.引用不正确. filter 确实保留了分区(由于您已经描述了原因),并且确认这一点很简单

You are of course right. The quote is just incorrect. filter does preserve partitioning (for the reason you've already described), and it is trivial to confirm that

val rdd = sc.range(0, 10).map(x => (x % 3, None)).partitionBy(
  new org.apache.spark.HashPartitioner(11)
)

rdd.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@b)

val filteredRDD = rdd.filter(_._1 == 3)
filteredRDD.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@b)

rdd.partitioner == filteredRDD.partitioner
// Boolean = true

这与 map 之类的操作相反,后者不保留分区( Partitioner ):

This stays in contrast to operations like map, which don't preserver partitioning (Partitioner):

rdd.map(identity _).partitioner
// Option[org.apache.spark.Partitioner] = None

数据集较为微妙,因为过滤器通常是下推式的,但总体行为是相似的.

Datasets are a bit more subtle, as filters are normally pushed-down, but overall the behavior is similar.

这篇关于为什么过滤器不保留分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆