在火花RDDS有序工会 [英] Ordered union on spark RDDs

查看：284 发布时间：2016/5/22 16:36:00 apache-spark rdd

本文介绍了在火花RDDS有序工会的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图做使用Apache火花键记录对键排序。关键是10字节长，值大约为90个字节长。换句话说，我试图复制用于断裂排序基准Databricks排序记录。其中一个我从文档中注意到的事情是，他们上排序键行对数，而不是主要的记录对来可能是缓存/ TLB友好。我试图复制这一做法，但还没有找到一个合适的解决方案。以下是我曾尝试：

I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have tried:

var keyValueRDD_1 = input.map(x => (x.substring(0, 10), x.substring(12, 13)))
var keyValueRDD_2 = input.map(x => (x.substring(0, 10), x.substring(14, 98))
var result = keyValueRDD_1.sortByKey(true, 1) // assume partitions = 1
var unionResult = result.union(keyValueRDD_2) 
var finalResult = unionResult.foldByKey("")(_+_)

当我这样做的结果RDD和keyValueRDD_2 RDD工会和打印unionResultRDD，结果和keyValueRDD_2的输出不交错。换句话说，它看起来像unionResult RDD具有keyValueRDD_2内容后跟结果RDD内容。但是，当我做它结合了同一个键的值成一个键值对一个foldByKey操作，排序顺序被破坏。我需要通过按键操作，以结果保存为原始密钥记录对做一个折。是否有可能被用来实现这一备用RDD功能？

When I do a union on the result RDD and keyValueRDD_2 RDD and print the output of the unionResultRDD, the result and keyValueRDD_2 are not interleaved. In other words, it looks like the unionResult RDD has the keyValueRDD_2 contents followed by the result RDD contents. However, when I do a foldByKey operation which combines the values of same key into a single key-value pair, the sorted order is destroyed. I need to do a fold by key operation in order to save the result as the original key-record pair. Is there an alternate rdd function that could be used to achieve this?

任何提示或建议将是非常有用的。
谢谢

Any tips or suggestions would be quite useful. Thanks

在火花RDDS有序工会 [英] Ordered union on spark RDDs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在火花RDDS有序工会 [英] Ordered union on spark RDDs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭