为什么在SchemaRDDs中，为什么要在Spark中使用.unionAll .unionAll? [英] Why would I want .union over .unionAll in Spark for SchemaRDDs?

查看：113 发布时间：2021/4/8 19:50:02 sql scala apache-spark union union-all

本文介绍了为什么在SchemaRDDs中，为什么要在Spark中使用.unionAll .unionAll?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在

def联合(其他:RDD [Row]):RDD [Row]

返回该RDD和另一个RDD的并集.

Return the union of this RDD and another one.

def unionAll(otherPlan:SchemaRDD):SchemaRDD

将两个RDD的元组与相同的架构组合在一起，并保持重复.

Combines the tuples of two RDDs with the same schema, keeping duplicates.

这不是UNION vs UNION ALL的标准行为，如该SO问题中所述.

This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.

我的代码来自 Spark SQL文档，使这两个函数返回相同的结果.

My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.

scala> case class Person(name: String, age: Int)
scala> import org.apache.spark.sql._
scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2),  Person("Gamma", 3)))
scala> val schemaString = "name age"
scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])

为什么我更喜欢一个?

为什么在SchemaRDDs中，为什么要在Spark中使用.unionAll .unionAll? [英] Why would I want .union over .unionAll in Spark for SchemaRDDs?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么在SchemaRDDs中，为什么要在Spark中使用.unionAll .unionAll? [英] Why would I want .union over .unionAll in Spark for SchemaRDDs?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭