Pyspark RDD:查找元素的索引 [英] Pyspark RDD: find index of an element
问题描述
我是 pyspark 的新手,我正在尝试将 python 中的列表转换为 rdd,然后我需要使用 rdd 查找元素索引.对于我正在做的第一部分:
I am new to pyspark and I am trying to convert a list in python to rdd and then I need to find elements index using the rdd. For the first part I am doing:
list = [[1,2],[1,4]]
rdd = sc.parallelize(list).cache()
所以现在 rdd 实际上是我的列表.问题是我想找到任何任意元素的索引,比如适用于 python 列表的索引"函数.我知道一个名为 zipWithIndex 的函数,它为每个元素分配索引,但我在 python 中找不到合适的例子(有 java 和 scala 的例子).
So now the rdd is actually my list. The thing is that I want to find index of any arbitrary element something like "index" function which works for python lists. I am aware of a function called zipWithIndex which assign index to each element but I could not find proper example in python (there are examples with java and scala).
谢谢.
推荐答案
使用 filter
和 zipWithIndex
:
rdd.zipWithIndex().
filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index) : index).collect()
请注意,这里的 [1,2]
可以很容易地更改为变量名,并且整个表达式可以包含在一个函数中.
Note that [1,2]
here can be easily changed to a variable name and this whole expression can be wrapped within a function.
zipWithIndex
简单地返回 (item
,index
) 的元组,如下所示:
zipWithIndex
simply returns a tuple of (item
,index
) like so:
rdd.zipWithIndex().collect()
> [([1, 2], 0), ([1, 4], 1)]
filter
仅查找匹配特定条件的那些(在这种情况下,key
等于特定的子列表):
filter
finds only those that match a particular criterion (in this case, that key
equals a specific sublist):
rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).collect()
> [([1, 2], 0)]
map
相当明显,我们可以取回索引:
map
is fairly obvious, we can just get back the index:
rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index): index).collect()
> [0]
然后我们可以根据需要通过索引 [0]
来简单地获取第一个元素.
and then we can simply get the first element by indexing [0]
if you want.
这篇关于Pyspark RDD:查找元素的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!