spark-地图内过滤器 [英] spark - filter within map

查看:95
本文介绍了spark-地图内过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试过滤map函数内部.基本上,我将在经典map-reduce中做到的方式是,当满足过滤条件时,mapper不会在上下文中写入任何内容.如何使用Spark实现类似目的?我似乎无法从map函数返回null,因为它在shuffle步骤中失败了.我可以使用过滤器功能,但似乎不必要的数据集迭代,而我可以在地图执行相同的任务.我还可以尝试使用虚拟密钥输出null,但这是一个不好的解决方法.

I am trying to filter inside map function. Basically the way I'll do that in classic map-reduce is mapper wont write anything to context when filter criteria meet. How can I achieve similar with spark? I can't seem to return null from map function as it fails in shuffle step. I can either use filter function but it seems unnecessary iteration of data set while I can perform same task during map. I can also try to output null with dummy key but thats a bad workaround.

推荐答案

选项很少:

rdd.flatMap:rdd.flatMap将把Traversable集合展平到RDD中.要选择元素,通常会在转换后返回Option.

rdd.flatMap: rdd.flatMap will flatten a Traversable collection into the RDD. To pick elements, you'll typically return an Option as result of the transformation.

rdd.flatMap(elem => if (filter(elem)) Some(f(elem)) else None)

rdd.collect(pf: PartialFunction)允许您提供部分功能,该功能可以过滤和转换原始RDD中的元素.您可以使用此方法使用模式匹配的所有功能.

rdd.collect(pf: PartialFunction) allows you to provide a partial function that can filter and transform elements from the original RDD. You can use all power of pattern matching with this method.

rdd.collect{case t if (cond(t)) => f(t)}
rdd.collect{case t:GivenType => f(t)}

正如Dean Wampler在评论中提到的那样,rdd.map(f(_)).filter(cond(_))可能比上面提到的其他简洁"选项同样好,甚至更快.

As Dean Wampler mentions in the comments, rdd.map(f(_)).filter(cond(_)) might be as good and even faster than the other more 'terse' options mentioned above.

其中f是转换(或映射)功能.

Where f is a transformation (or map) function.

这篇关于spark-地图内过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆