为Spark RDD中的每个键创建唯一的值 [英] create unique values for each key in a spark RDD
问题描述
我想创建一个键,值
对的RDD,其中每个键都有一个唯一的值.这样做的目的是记住"键索引以供以后使用,因为键可能会在分区周围被打乱,并基本上创建了各种查找表.我正在对一些文本进行矢量化处理,并且需要创建特征矢量,因此我必须为每个键都具有唯一的值.
我尝试将第二个RDD压缩到我的键的RDD上,但是问题是,如果两个RDD的划分方式不完全相同,则最终会丢失元素.
我的第二次尝试是使用哈希生成器,例如 I tried this with zipping a second RDD to my RDD of keys, but the problem is that if the two RDDs are not partitioned in exactly the same way, you end up losing elements. My second attempt is to use a hash generator like the one used in scikit-learn but I'm wondering if there is some other "spark-native" way of doing this? I'm using PySpark, not Scala... If you're using an older version of Spark, you should be able cherry-pick that commit in order to backport these functions, since I think it only adds lines to 这篇关于为Spark RDD中的每个键创建唯一的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! zipWithIndex
和 zipWithUniqueId
刚刚添加到PySpark(
zipWithIndex
and zipWithUniqueId
were just added to PySpark (https://github.com/apache/spark/pull/2092) and will be available in the forthcoming Spark 1.1.0 release (they're currently available in the Spark master
branch).rdd.py
.