Pyspark RDD:查找元素的索引 [英] Pyspark RDD: find index of an element

查看:55
本文介绍了Pyspark RDD:查找元素的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 pyspark 的新手,我正在尝试将 python 中的列表转换为 rdd,然后我需要使用 rdd 查找元素索引.对于我正在做的第一部分:

I am new to pyspark and I am trying to convert a list in python to rdd and then I need to find elements index using the rdd. For the first part I am doing:

list = [[1,2],[1,4]]
rdd = sc.parallelize(list).cache()

所以现在 rdd 实际上是我的列表.问题是我想找到任何任意元素的索引,比如适用于 python 列表的索引"函数.我知道一个名为 zipWithIndex 的函数,它为每个元素分配索引,但我在 python 中找不到合适的例子(有 java 和 scala 的例子).

So now the rdd is actually my list. The thing is that I want to find index of any arbitrary element something like "index" function which works for python lists. I am aware of a function called zipWithIndex which assign index to each element but I could not find proper example in python (there are examples with java and scala).

谢谢.

推荐答案

使用 filterzipWithIndex:

rdd.zipWithIndex().
filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index) : index).collect()

请注意,这里的 [1,2] 可以很容易地更改为变量名,并且整个表达式可以包含在一个函数中.

Note that [1,2] here can be easily changed to a variable name and this whole expression can be wrapped within a function.

zipWithIndex 简单地返回 (item,index) 的元组,如下所示:

zipWithIndex simply returns a tuple of (item,index) like so:

rdd.zipWithIndex().collect()
> [([1, 2], 0), ([1, 4], 1)]

filter 仅查找匹配特定条件的那些(在这种情况下,key 等于特定的子列表):

filter finds only those that match a particular criterion (in this case, that key equals a specific sublist):

rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).collect()
> [([1, 2], 0)]

map 相当明显,我们可以取回索引:

map is fairly obvious, we can just get back the index:

rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index): index).collect()
> [0]

然后我们可以根据需要通过索引 [0] 来简单地获取第一个元素.

and then we can simply get the first element by indexing [0] if you want.

这篇关于Pyspark RDD:查找元素的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆