如何在Spark的RDD中获得元素位置? [英] How Can I Obtain an Element Position in Spark's RDD?

查看:577
本文介绍了如何在Spark的RDD中获得元素位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Apache Spark的新手,我知道核心数据结构是RDD.现在,我正在编写一些需要元素位置信息的应用程序.例如,将ArrayList转换为(Java)RDD之后,对于RDD中的每个整数,我需要知道其(全局)数组下标.有可能做到吗?

I am new to Apache Spark, and I know that the core data structure is RDD. Now I am writing some apps which require element positional information. For example, after converting an ArrayList into a (Java)RDD, for each integer in RDD, I need to know its (global) array subscript. Is it possible to do it?

据我所知,RDD具有 take(int)功能,因此我认为位置信息仍保留在RDD中.

As I know, there is a take(int) function for RDD, so I believe the positional information is still maintained in RDD.

推荐答案

本质上,RDD的zipWithIndex()方法似乎可以做到这一点,但它不会保留创建RDD的数据的原始顺序.至少您会获得稳定的订购.

Essentially, RDD's zipWithIndex() method seems to do this, but it won't preserve the original ordering of the data the RDD was created from. At least you'll get a stable ordering.

val orig: RDD[String] = ...
val indexed: RDD[(String, Long)] = orig.zipWithIndex()

您不太可能找到保留原始数据顺序的原因的原因已隐藏在zipWithIndex()的API文档中:

The reason you're unlikely to find something that preserves the order in the original data is buried in the API doc for zipWithIndex():

用其元素索引压缩此RDD.排序首先基于 在分区索引上,然后在每个索引中进行排序 分割.因此,第一个分区中的第一项的索引为0,并且 最后一个分区中的最后一项获得最大的索引.这 与Scala的zipWithIndex相似,但它使用Long而不是Int作为 索引类型.当此RDD时,此方法需要触发火花作业 包含多个分区."

"Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions."

因此,看起来原始订单已被丢弃.如果保留原始顺序对您很重要,则看起来您需要在创建RDD之前在之前添加索引.

So it looks like the original order is discarded. If preserving the original order is important to you, it looks like you need to add the index before you create the RDD.

这篇关于如何在Spark的RDD中获得元素位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆