在火花RDDS有序工会 [英] Ordered union on spark RDDs

查看:284
本文介绍了在火花RDDS有序工会的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图做使用Apache火花键记录对键排序。关键是10字节长,值大约为90个字节长。换句话说,我试图复制用于断裂排序基准Databricks排序记录。其中一个我从文档中注意到的事情是,他们上排序键行对数,而不是主要的记录对来可能是缓存/ TLB友好。我试图复制这一做法,但还没有找到一个合适的解决方案。以下是我曾尝试:

I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have tried:

var keyValueRDD_1 = input.map(x => (x.substring(0, 10), x.substring(12, 13)))
var keyValueRDD_2 = input.map(x => (x.substring(0, 10), x.substring(14, 98))
var result = keyValueRDD_1.sortByKey(true, 1) // assume partitions = 1
var unionResult = result.union(keyValueRDD_2) 
var finalResult = unionResult.foldByKey("")(_+_)

当我这样做的结果RDD和keyValueRDD_2 RDD工会和打印unionResultRDD,结果和keyValueRDD_2的输出不交错。换句话说,它看起来像unionResult RDD具有keyValueRDD_2内容后跟结果RDD内容。但是,当我做它结合了同一个键的值成一个键值对一个foldByKey操作,排序顺序被破坏。我需要通过按键操作,以结果保存为原始密钥记录对做一个折。是否有可能被用来实现这一备用RDD功能?

When I do a union on the result RDD and keyValueRDD_2 RDD and print the output of the unionResultRDD, the result and keyValueRDD_2 are not interleaved. In other words, it looks like the unionResult RDD has the keyValueRDD_2 contents followed by the result RDD contents. However, when I do a foldByKey operation which combines the values of same key into a single key-value pair, the sorted order is destroyed. I need to do a fold by key operation in order to save the result as the original key-record pair. Is there an alternate rdd function that could be used to achieve this?

任何提示或建议将是非常有用的。
谢谢

Any tips or suggestions would be quite useful. Thanks

推荐答案

联盟方法只是把2 RDDS一前一后,除非它们具有相同的分区。然后,它加入了分区。

The union method just puts two RDDs one after the other, except if they have the same partitioner. Then it joins the partitions.

你想做的事是不可能的。

What you want to do is impossible.

当你有一个RDD排序( keyValueRDD_1 ),并使用相同的按键另一个未分类RDD( keyValueRDD_2 ),那么拿到排序第二的RDD唯一的办法就是排序。

When you have one RDD sorted (keyValueRDD_1) and another unsorted RDD with the same keys (keyValueRDD_2) then the only way to get the second RDD sorted is to sort it.

排序RDD的存在并不能帮助我们梳理第二RDD。

The existence of the sorted RDD does not help us sort the second RDD.

该Databricks 文章了解优化了会谈在当地执行者会发生。洗牌步骤之后,记录被大致分类。每个分区现已覆盖一定范围的密钥,但分区是无序。

The Databricks article talks about an optimization that happens locally on the executors. After the shuffle step, the records are roughly sorted. Each partition now covers a range of keys, but the partitions are unsorted.

现在,你必须每个分区本地排序,这哪里是preFIX优化与缓存定位有所帮助。

Now you have to sort each partition locally, and this is where the prefix optimization helps with cache locality.

这篇关于在火花RDDS有序工会的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆