为Spark RDD中的每个键创建唯一的值 [英] create unique values for each key in a spark RDD

查看:121
本文介绍了为Spark RDD中的每个键创建唯一的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个键,值对的RDD,其中每个键都有一个唯一的值.这样做的目的是记住"键索引以供以后使用,因为键可能会在分区周围被打乱,并基本上创建了各种查找表.我正在对一些文本进行矢量化处理,并且需要创建特征矢量,因此我必须为每个键都具有唯一的值.

我尝试将第二个RDD压缩到我的键的RDD上,但是问题是,如果两个RDD的划分方式不完全相同,则最终会丢失元素.

我的第二次尝试是使用哈希生成器,例如

zipWithIndex zipWithUniqueId 刚刚添加到PySpark(

I tried this with zipping a second RDD to my RDD of keys, but the problem is that if the two RDDs are not partitioned in exactly the same way, you end up losing elements.

My second attempt is to use a hash generator like the one used in scikit-learn but I'm wondering if there is some other "spark-native" way of doing this? I'm using PySpark, not Scala...

解决方案

zipWithIndex and zipWithUniqueId were just added to PySpark (https://github.com/apache/spark/pull/2092) and will be available in the forthcoming Spark 1.1.0 release (they're currently available in the Spark master branch).

If you're using an older version of Spark, you should be able cherry-pick that commit in order to backport these functions, since I think it only adds lines to rdd.py.

这篇关于为Spark RDD中的每个键创建唯一的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆