如何在具有不同列数的RDD上执行Set转换? [英] How to perform Set transformations on RDD's with different number of columns?
问题描述
我有两个 RDD
.一个 RDD
的类型为 RDD [(String,String,String)]
,第二个 RDD
的类型为 RDD [(String,字符串,字符串,字符串,字符串)]
.每当我尝试执行诸如并集,交集等操作时,都会收到错误消息:-
I have two RDD
s. One RDD
is of type RDD[(String, String, String)]
and the second RDD
is of type RDD[(String, String, String, String, String)]
. Whenever I try to perform operations like union, intersection, etc, I get the error :-
error: type mismatch;
found: org.apache.spark.rdd.RDD[(String, String, String, String,String, String)]
required: org.apache.spark.rdd.RDD[(String, String, String)]
uid.union(uid1).first()
在这种情况下如何执行设置操作?如果根本无法进行设置操作,我该怎么办才能获得与设置操作相同的结果,而不会出现类型不匹配的问题?
How can I perform the set operations in this case? If set operations are not possible at all, what can I do to get the same result as set operations without having the type mismatch problem?
这是两个RDD中前几行的示例:
Here's a sample of the first lines from both the RDDs :
(" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502")
(fb_100007609418328,-795000,r316079113_serv60i)
推荐答案
多个操作需要两个 RDD
具有相同的类型.
Several operations require two RDD
s to have the same type.
让我们以 union
为例: union
基本上连接两个 RDD
.如您所料,将以下内容串联起来是不合理的:
Let's take union
for example: union
basically concatenates two RDD
s. As you can imagine it would be unsound to concatenate the following:
RDD1
(1, 2)
(3, 4)
RDD2
(5, 6, "string1")
(7, 8, "string2")
如您所见, RDD2
有另外一列.您可以做的一件事,就是对 RDD1
进行操作,使其模式与 RDD2
的模式匹配,例如,通过添加默认值:
As you see, RDD2
has one extra column. One thing that you can do, is work on RDD1
to that its schema matches that of RDD2
, for example by adding a default value:
RDD1
(1, 2)
(3, 4)
RDD1 (AMENDED)
(1, 2, "default")
(3, 4, "default")
RDD2
(5, 6, "string1")
(7, 8, "string2")
UNION
(1, 2, "default")
(3, 4, "default")
(5, 6, "string1")
(7, 8, "string2")
您可以使用以下代码来实现:
You can achieve this with the following code:
val sc: SparkContext = ??? // your SparkContext
val rdd1: RDD[(Int, Int)] =
sc.parallelize(Seq((1, 2), (3, 4)))
val rdd2: RDD[(Int, Int, String)] =
sc.parallelize(Seq((5, 6, "string1"), (7, 8, "string2")))
val amended: RDD[(Int, Int, String)] =
rdd1.map(pair => (pair._1, pair._2, "default"))
val union: RDD[(Int, Int, String)] =
amended.union(rdd2)
如果您知道打印内容
union.foreach(println)
您将获得上面示例中的结果.
you will get what we ended up having in the above example.
当然,您希望两个 RDD
匹配的确切语义取决于您的问题.
Of course, the exact semantics of how you want the two RDD
s to match depend on your problem.
这篇关于如何在具有不同列数的RDD上执行Set转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!