多个连续加入pyspark [英] Multiple consecutive join with pyspark

查看：71 发布时间：2020/9/4 3:58:55 python apache-spark pyspark apache-spark-sql

本文介绍了多个连续加入pyspark的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将多个DF一起加入.因为联接是如何工作的，所以我得到了重复的相同列名.

I'm trying to join multiple DF together. Because how join work, I got the same column name duplicated all over.

在(K，V)和(K，W)类型的数据集上调用时，返回一个数据集 (K，(V，W))对中的每个键的所有元素对.

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

# Join Min and Max to S1
joinned_s1 = (minTime.join(maxTime, minTime["UserId"] == maxTime["UserId"]))

# Join S1 and sum to s2
joinned_s2 = (joinned_s1.join(sumTime, joinned_s1["UserId"] == sumTime["UserId"]))

我收到此错误:参考'UserId'不明确，可能是:UserId#1578，UserId#3014."

I got this error: ""Reference 'UserId' is ambiguous, could be: UserId#1578, UserId#3014.;"

成功加入后，从数据集中删除W的正确方法是什么?

What is the proper way of removing W from my dataset once successfully joined?

推荐答案

您可以使用等值联接:

 minTime.join(maxTime, ["UserId"]).join(sumTime, ["UserId"])

别名:

minTime.alias("minTime").join(
    maxTime.alias("maxTime"), 
    col("minTime.UserId") == col("maxTime.UserId")
)

或参考父表:

(minTime
  .join(maxTime, minTime["UserId"] == maxTime["UserId"])
  .join(sumTime, minTime["UserId"] == sumTime["UserId"]))

作为旁注，您引用的是RDD文档，而不是DataFrame文档.这些是不同的数据结构，并且操作方式不同.

On as side note you're quoting RDD docs, not DataFrame ones. These are different data structures and don't operate in the same way.

看起来您在这里做的事情很奇怪.假设您只有一个父表min，则max和sum可以作为没有join的简单聚合进行计算.

Also it looks like you're doing something strange here. Assuming you have a single parent table min, max and sum can be computed as simple aggregations without join.

这篇关于多个连续加入pyspark的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

多个连续加入pyspark [英] Multiple consecutive join with pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

多个连续加入pyspark [英] Multiple consecutive join with pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭