多个RDD的火花联合 [英] Spark union of multiple RDDs

查看：99 发布时间：2020/9/3 23:36:48 python apache-spark pyspark rdd

本文介绍了多个RDD的火花联合的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的猪代码中，我这样做:

In my pig code I do this:

all_combined = Union relation1, relation2, 
    relation3, relation4, relation5, relation 6.

我想对spark做同样的事情.但是，不幸的是，我看到我必须成对进行操作:

I want to do the same with spark. However, unfortunately, I see that I have to keep doing it pairwise:

first = rdd1.union(rdd2)
second = first.union(rdd3)
third = second.union(rdd4)
# .... and so on

是否有一个联合运算符可让我一次对多个rdds进行操作:

Is there a union operator that will let me operate on multiple rdds at a time:

例如union(rdd1, rdd2,rdd3, rdd4, rdd5, rdd6)

这是一个方便的问题.

推荐答案

如果这些是RDD，则可以使用SparkContext.union方法:

If these are RDDs you can use SparkContext.union method:

rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
rdd3 = sc.parallelize([7, 8, 9])

rdd = sc.union([rdd1, rdd2, rdd3])
rdd.collect()

## [1, 2, 3, 4, 5, 6, 7, 8, 9]

没有DataFrame等价物，但这只是一个简单的单行代码的问题:

There is no DataFrame equivalent but it is just a matter of a simple one-liner:

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)

df1 = sqlContext.createDataFrame([(1, "foo1"), (2, "bar1")], ("k", "v"))
df2 = sqlContext.createDataFrame([(3, "foo2"), (4, "bar2")], ("k", "v"))
df3 = sqlContext.createDataFrame([(5, "foo3"), (6, "bar3")], ("k", "v"))

unionAll(df1, df2, df3).show()

## +---+----+
## |  k|   v|
## +---+----+
## |  1|foo1|
## |  2|bar1|
## |  3|foo2|
## |  4|bar2|
## |  5|foo3|
## |  6|bar3|
## +---+----+

如果在RDD上使用SparkContext.union的DataFrames数量很大，则重新创建DataFrame可能是避免

If number of DataFrames is large using SparkContext.union on RDDs and recreating DataFrame may be a better choice to avoid issues related to the cost of preparing an execution plan:

def unionAll(*dfs):
    first, *_ = dfs  # Python 3.x, for 2.x you'll have to unpack manually
    return first.sql_ctx.createDataFrame(
        first.sql_ctx._sc.union([df.rdd for df in dfs]),
        first.schema
    )

这篇关于多个RDD的火花联合的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

多个RDD的火花联合 [英] Spark union of multiple RDDs

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

多个RDD的火花联合 [英] Spark union of multiple RDDs

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭