使用现有键创建值列表的组合-Pyspark [英] Creating combination of value list with existing key - Pyspark
问题描述
所以我的rdd包含如下数据:
So my rdd consists of data looking like:
(k, [v1,v2,v3...])
我想为值部分创建所有两组的组合.
I want to create a combination of all sets of two for the value part.
因此,最终地图应如下所示:
So the end map should look like:
(k1, (v1,v2))
(k1, (v1,v3))
(k1, (v2,v3))
我知道要获得价值,我会使用
I know to get the value part, I would use something like
rdd.cartesian(rdd).filter(case (a,b) => a < b)
但是,这要求传递整个rdd(对吗?),而不仅仅是价值部分.我不确定如何到达自己想要的终点,我怀疑它是一群人.
However, that requires the entire rdd to be passed (right?) not just the value part. I am unsure how to arrive at my desired end, I suspect its a groupby.
另外,最终,我想进入k,v像
Also, ultimately, I want to get to the k,v looking like
((k1,v1,v2),1)
我知道如何从寻找的东西中得到帮助,但是也许更容易直接去那里?
I know how to get from what I am looking for to that, but maybe its easier to go straight there?
谢谢.
推荐答案
我认为以色列的答案是不完整的,所以我走了一步.
I think Israel's answer is a incomplete, so I go a step further.
import itertools
a = sc.parallelize([
(1, [1,2,3,4]),
(2, [3,4,5,6]),
(3, [-1,2,3,4])
])
def combinations(row):
l = row[1]
k = row[0]
return [(k, v) for v in itertools.combinations(l, 2)]
a.map(combinations).flatMap(lambda x: x).take(3)
# [(1, (1, 2)), (1, (1, 3)), (1, (1, 4))]
这篇关于使用现有键创建值列表的组合-Pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!