从(键,值),其中值是火花SQL命令的价值由星火pairRDD [英] Order by Value in Spark pairRDD from (Key,Value) where the value is from spark-sql

查看:150
本文介绍了从(键,值),其中值是火花SQL命令的价值由星火pairRDD的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个地图这样的 -

  VAL B = a.map(X =>(X(0),X))

下面b是类型的

  org.apache.spark.rdd.RDD [(任何,org.apache.spark.sql.Row)


  1. 如何每个按键内使用从值行的字段排序PairRDD?

  2. 后,我要运行的进程的所有值在previously的排序顺序孤立每个键的功能。那可能吗?如果是的话可以请你举一个例子。

  3. 有需要的分区对RDD?任何代价


解决方案

只回答你第一个问题:

  VAL indexToSelect:INT =? //指向可排序类型(有订购或有序)
分类= rdd.sortBy(双= GT; pair._2(indexToSelect))

这做什么,它只是选择在对第二个值( pair._2 ),并从该行是选择合适的值( (indexToSelect)以上冗长:。适用(indexToSelect)

I have created a map like this -

val b = a.map(x => (x(0), x) ) 

Here b is of the type

org.apache.spark.rdd.RDD[(Any, org.apache.spark.sql.Row)]

  1. How can I sort the PairRDD within each key using a field from the value row?
  2. After that I want to run a function which processes all the values for each Key in isolation in the previously sorted order. Is that possible? If yes can you please give an example.
  3. Is there any consideration needed for Partitioning the Pair RDD?

解决方案

Answering only your first question:

val indexToSelect: Int = ??? //points to sortable type (has Ordering or is Ordered)
sorted = rdd.sortBy(pair => pair._2(indexToSelect))

What this does, it just selects the second value in the pair (pair._2) and from that row it selects the appropriate value ((indexToSelect) or more verbosely: .apply(indexToSelect)).

这篇关于从(键,值),其中值是火花SQL命令的价值由星火pairRDD的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆