如何分配唯一的连续号码的元素在Spark RDD [英] How to assign unique contiguous numbers to elements in a Spark RDD
问题描述
我有一个数据集(用户,产品,评论)
,并希望将其送入mllib的ALS算法。
I have a dataset of (user, product, review)
, and want to feed it into mllib's ALS algorithm.
算法需要的用户和产品是数字,而我是字符串的用户名和字符串的SKU。
The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs.
现在,我得到了不同用户和SKU,然后分配数字ID给他们星火之外。
Right now, I get the distinct users and SKUs, then assign numeric IDs to them outside of Spark.
我不知道是否有这样做的更好的方法。的一种方法,我认为是编写主要是通过 N
,然后调用上的两个RDDS拉链列举1自定义RDD。
I was wondering whether there was a better way of doing this. The one approach I've thought of is to write a custom RDD that essentially enumerates 1 through n
, then call zip on the two RDDs.
推荐答案
与开始的星火1.0 有两种方法可以用来解决这个问题很容易:
Starting with Spark 1.0 there are two methods you can use to solve this easily:
-
RDD.zipWithIndex
就像Seq.zipWithIndex
,它增加了连续的(龙
)的数字。这需要先计算每个分区中的元素,让你的输入将被两次评估。如果要使用此缓存您的输入RDD。 -
RDD.zipWithUniqueId
还为您提供了独特的龙
标识,但它们不能保证是连续的。 (他们只会是连续的,如果每个分区都有相同数量的元素。)的好处是,这并不需要知道关于输入任何东西,所以它不会造成双重评价。
RDD.zipWithIndex
is just likeSeq.zipWithIndex
, it adds contiguous (Long
) numbers. This needs to count the elements in each partition first, so your input will be evaluated twice. Cache your input RDD if you want to use this.RDD.zipWithUniqueId
also gives you uniqueLong
IDs, but they are not guaranteed to be contiguous. (They will only be contiguous if each partition has the same number of elements.) The upside is that this does not need to know anything about the input, so it will not cause double-evaluation.
这篇关于如何分配唯一的连续号码的元素在Spark RDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!