为什么在SchemaRDDs中,为什么要在Spark中使用.unionAll .unionAll? [英] Why would I want .union over .unionAll in Spark for SchemaRDDs?

查看:113
本文介绍了为什么在SchemaRDDs中,为什么要在Spark中使用.unionAll .unionAll?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在

  • def联合(其他:RDD [Row]):RDD [Row]

    返回该RDD和另一个RDD的并集.

    Return the union of this RDD and another one.

    def unionAll(otherPlan:SchemaRDD):SchemaRDD

    将两个RDD的元组与相同的架构组合在一起,并保持重复.

    Combines the tuples of two RDDs with the same schema, keeping duplicates.

    这不是UNION vs UNION ALL的标准行为,如该SO问题中所述.

    This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.

    我的代码来自 Spark SQL文档,使这两个函数返回相同的结果.

    My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.

    scala> case class Person(name: String, age: Int)
    scala> import org.apache.spark.sql._
    scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
    scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2),  Person("Gamma", 3)))
    scala> val schemaString = "name age"
    scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
    scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
    scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
    scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
    res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
    scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
    res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
    

    为什么我更喜欢一个?

    推荐答案

    在Spark 1.6中,删除了以上版本的 union ,因此仅剩下了 unionAll .

    In Spark 1.6, the above version of union was removed, so unionAll was all that remained.

    在Spark 2.0中, unionAll 重命名为 union ,并保留了 unionAll 以实现向后兼容(我想).

    In Spark 2.0, unionAll was renamed to union, with unionAll kept in for backward compatibility (I guess).

    无论如何,在

    In any case, no deduplication is done in either union (Spark 2.0) or unionAll (Spark 1.6).

    这篇关于为什么在SchemaRDDs中,为什么要在Spark中使用.unionAll .unionAll?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆