spark-地图内过滤器 [英] spark - filter within map
问题描述
我正在尝试过滤map函数内部.基本上,我将在经典map-reduce中做到的方式是,当满足过滤条件时,mapper不会在上下文中写入任何内容.如何使用Spark实现类似目的?我似乎无法从map函数返回null,因为它在shuffle步骤中失败了.我可以使用过滤器功能,但似乎不必要的数据集迭代,而我可以在地图执行相同的任务.我还可以尝试使用虚拟密钥输出null,但这是一个不好的解决方法.
I am trying to filter inside map function. Basically the way I'll do that in classic map-reduce is mapper wont write anything to context when filter criteria meet. How can I achieve similar with spark? I can't seem to return null from map function as it fails in shuffle step. I can either use filter function but it seems unnecessary iteration of data set while I can perform same task during map. I can also try to output null with dummy key but thats a bad workaround.
推荐答案
选项很少:
rdd.flatMap
:rdd.flatMap
将把Traversable
集合展平到RDD中.要选择元素,通常会在转换后返回Option
.
rdd.flatMap
: rdd.flatMap
will flatten a Traversable
collection into the RDD. To pick elements, you'll typically return an Option
as result of the transformation.
rdd.flatMap(elem => if (filter(elem)) Some(f(elem)) else None)
rdd.collect(pf: PartialFunction)
允许您提供部分功能,该功能可以过滤和转换原始RDD中的元素.您可以使用此方法使用模式匹配的所有功能.
rdd.collect(pf: PartialFunction)
allows you to provide a partial function that can filter and transform elements from the original RDD. You can use all power of pattern matching with this method.
rdd.collect{case t if (cond(t)) => f(t)}
rdd.collect{case t:GivenType => f(t)}
正如Dean Wampler在评论中提到的那样,rdd.map(f(_)).filter(cond(_))
可能比上面提到的其他简洁"选项同样好,甚至更快.
As Dean Wampler mentions in the comments, rdd.map(f(_)).filter(cond(_))
might be as good and even faster than the other more 'terse' options mentioned above.
其中f
是转换(或映射)功能.
Where f
is a transformation (or map) function.
这篇关于spark-地图内过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!