多个连续加入pyspark [英] Multiple consecutive join with pyspark
问题描述
我正在尝试将多个DF一起加入.因为联接是如何工作的,所以我得到了重复的相同列名.
I'm trying to join multiple DF together. Because how join work, I got the same column name duplicated all over.
在(K,V)和(K,W)类型的数据集上调用时,返回一个数据集 (K,(V,W))对中的每个键的所有元素对.
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
# Join Min and Max to S1
joinned_s1 = (minTime.join(maxTime, minTime["UserId"] == maxTime["UserId"]))
# Join S1 and sum to s2
joinned_s2 = (joinned_s1.join(sumTime, joinned_s1["UserId"] == sumTime["UserId"]))
我收到此错误:参考'UserId'不明确,可能是:UserId#1578,UserId#3014."
I got this error: ""Reference 'UserId' is ambiguous, could be: UserId#1578, UserId#3014.;"
成功加入后,从数据集中删除W的正确方法是什么?
What is the proper way of removing W from my dataset once successfully joined?
推荐答案
您可以使用等值联接:
minTime.join(maxTime, ["UserId"]).join(sumTime, ["UserId"])
别名:
minTime.alias("minTime").join(
maxTime.alias("maxTime"),
col("minTime.UserId") == col("maxTime.UserId")
)
或参考父表:
(minTime
.join(maxTime, minTime["UserId"] == maxTime["UserId"])
.join(sumTime, minTime["UserId"] == sumTime["UserId"]))
作为旁注,您引用的是RDD
文档,而不是DataFrame
文档.这些是不同的数据结构,并且操作方式不同.
On as side note you're quoting RDD
docs, not DataFrame
ones. These are different data structures and don't operate in the same way.
看起来您在这里做的事情很奇怪.假设您只有一个父表min
,则max
和sum
可以作为没有join
的简单聚合进行计算.
Also it looks like you're doing something strange here. Assuming you have a single parent table min
, max
and sum
can be computed as simple aggregations without join
.
这篇关于多个连续加入pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!