外连接 Spark 数据框与不同的连接列,然后合并连接列 [英] Outer join Spark dataframe with non-identical join column and then merge join column

查看:19
本文介绍了外连接 Spark 数据框与不同的连接列,然后合并连接列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在 pySpark 中有以下数据框:

Suppose I have the following dataframes in pySpark:

df1 = sqlContext.createDataFrame([Row(name='john', age=50), Row(name='james', age=25)])
df2 = sqlContext.createDataFrame([Row(name='john', weight=150), Row(name='mike', weight=115)])
df3 = sqlContext.createDataFrame([Row(name='john', age=50, weight=150), Row(name='james', age=25, weight=None), Row(name='mike', age=None, weight=115)])

现在假设我想通过加入/合并 df1df2 来创建 df3.

Now suppose I want to create df3 from joining/merging df1 and df2.

我尝试过

df1.join(df2, df1.name == df2.name, 'outer')

这并不完全有效,因为它产生了两个名称列.然后我需要以某种方式组合两个名称列,以便一个名称列中缺失的名称由另一个名称列中缺失的名称填充.

This doesn't quite work exactly because it produces two name columns. I need to then somehow combine the two name columns so that missing names from one name column are filled in by the missing name from the other name column.

我该怎么做?或者有没有更好的方法从 df1df2 创建 df3 ?

How would I do that? Or is there a better way to create df3 from df1 and df2?

推荐答案

您可以使用 coallesce 函数,它返回第一个非空参数.

You can use coallesce function which returns the first not-null argument.

from pyspark.sql.functions import coalesce

df1 = df1.alias("df1")
df2 = df2.alias("df2")

(df1.join(df2, df1.name == df2.name, 'outer')
  .withColumn("name_", coalesce("df1.name", "df2.name"))
  .drop("name")
  .withColumnRenamed("name_", "name"))

这篇关于外连接 Spark 数据框与不同的连接列,然后合并连接列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆