如何在具有不同列数的RDD上执行Set转换? [英] How to perform Set transformations on RDD's with different number of columns?

查看:88
本文介绍了如何在具有不同列数的RDD上执行Set转换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 RDD .一个 RDD 的类型为 RDD [(String,String,String)] ,第二个 RDD 的类型为 RDD [(String,字符串,字符串,字符串,字符串)] .每当我尝试执行诸如并集,交集等操作时,都会收到错误消息:-

I have two RDDs. One RDD is of type RDD[(String, String, String)] and the second RDD is of type RDD[(String, String, String, String, String)]. Whenever I try to perform operations like union, intersection, etc, I get the error :-

error: type mismatch;
found: org.apache.spark.rdd.RDD[(String, String, String, String,String, String)]
required: org.apache.spark.rdd.RDD[(String, String, String)]
   uid.union(uid1).first()

在这种情况下如何执行设置操作?如果根本无法进行设置操作,我该怎么办才能获得与设置操作相同的结果,而不会出现类型不匹配的问题?

How can I perform the set operations in this case? If set operations are not possible at all, what can I do to get the same result as set operations without having the type mismatch problem?

这是两个RDD中前几行的示例:

Here's a sample of the first lines from both the RDDs :

(" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502") 

(fb_100007609418328,-795000,r316079113_serv60i) 

推荐答案

多个操作需要两个 RDD 具有相同的类型.

Several operations require two RDDs to have the same type.

让我们以 union 为例: union 基本上连接两个 RDD .如您所料,将以下内容串联起来是不合理的:

Let's take union for example: union basically concatenates two RDDs. As you can imagine it would be unsound to concatenate the following:

RDD1
(1, 2)
(3, 4)

RDD2
(5, 6, "string1")
(7, 8, "string2")

如您所见, RDD2 有另外一列.您可以做的一件事,就是对 RDD1 进行操作,使其模式与 RDD2 的模式匹配,例如,通过添加默认值:

As you see, RDD2 has one extra column. One thing that you can do, is work on RDD1 to that its schema matches that of RDD2, for example by adding a default value:

RDD1
(1, 2)
(3, 4)

RDD1 (AMENDED)
(1, 2, "default")
(3, 4, "default")

RDD2
(5, 6, "string1")
(7, 8, "string2")

UNION
(1, 2, "default")
(3, 4, "default")
(5, 6, "string1")
(7, 8, "string2")

您可以使用以下代码来实现:

You can achieve this with the following code:

val sc: SparkContext = ??? // your SparkContext

val rdd1: RDD[(Int, Int)] =
  sc.parallelize(Seq((1, 2), (3, 4)))

val rdd2: RDD[(Int, Int, String)] =
  sc.parallelize(Seq((5, 6, "string1"), (7, 8, "string2")))

val amended: RDD[(Int, Int, String)] =
  rdd1.map(pair => (pair._1, pair._2, "default"))

val union: RDD[(Int, Int, String)] =
  amended.union(rdd2)

如果您知道打印内容

union.foreach(println)

您将获得上面示例中的结果.

you will get what we ended up having in the above example.

当然,您希望两个 RDD 匹配的确切语义取决于您的问题.

Of course, the exact semantics of how you want the two RDDs to match depend on your problem.

这篇关于如何在具有不同列数的RDD上执行Set转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆