Spark 中的 Dataframe 连接可以保留顺序吗? [英] Can Dataframe joins in Spark preserve order?
问题描述
我目前正在尝试将两个 DataFrame 连接在一起,但在其中一个 DataFrame 中保留相同的顺序.
I'm currently trying to join two DataFrames together but retain the same order in one of the Dataframes.
从 哪些操作保留 RDD 顺序?,似乎(如果这是不准确的,因为我是 Spark 的新手)连接不保留顺序,因为由于数据位于不同的分区中,行连接/到达"最终数据帧的顺序不是指定的顺序.
From Which operations preserve RDD order?, it seems that (correct me if this is inaccurate because I'm new to Spark) joins do not preserve order because rows are joined / "arrive" at the final dataframe not in a specified order due to the data being in different partitions.
如何在保留一张表的顺序的同时执行两个 DataFrame 的连接?
How could one perform a join of two DataFrames while preserving the order of one table?
例如,
<代码>+------------+---------+|列 1 |col2 |+------------+------------+|0 |||1 |乙 |+------------+---------+
加入
<代码>+------------+---------+|col2 |col3 |+------------+------------+|乙 |× |||是 |+------------+---------+
on col2 应该给
<代码>+------------+------------+|列 1 |col2 |第 3 列 |+------------+---------+----------+|0 ||是 ||1 |乙 |× |+------------+---------+---------+
我听说过一些关于使用 coalesce
或 repartition
的事情,但我不确定.任何建议/方法/见解表示赞赏.
I've heard some things about using coalesce
or repartition
, but I'm not sure. Any suggestions/methods/insights are appreciated.
编辑:这是否类似于在 MapReduce 中使用一个 reducer?如果是这样,那在 Spark 中会是什么样子?
Edit: would this be analogous to having one reducer in MapReduce? If so, how would that look like in Spark?
推荐答案
不能.您可以添加 monotonically_increasing_id
并在加入后重新排序数据.
It can't. You can add monotonically_increasing_id
and reorder data after join.
这篇关于Spark 中的 Dataframe 连接可以保留顺序吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!