Pyspark Join,然后列选择显示意外的输出 [英] Pyspark Join and then column select is showing unexpected output

查看:282
本文介绍了Pyspark Join,然后列选择显示意外的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定是否需要很长时间才能对我这样做,但是我在spark 2.2.0中看到了一些意外的行为

I am not sure if the long work is doing this to me but I am seeing some unexpected behavior in spark 2.2.0

我创建了一个玩具示例,如下所示

I have created a toy example as below

toy_df = spark.createDataFrame([
['p1','a'],
['p1','b'],
['p1','c'],
['p2','a'],
['p2','b'],
['p2','d']],schema=['patient','drug']) 

我创建另一个数据框

mdf = toy_df.filter(toy_df.drug == 'c')

您知道mdf是

 mdf.show()
+-------+----+
|patient|drug|
+-------+----+
|     p1|   c|
+-------+----+ 

现在,如果我这样做

toy_df.join(mdf,["patient"],"left").select(toy_df.patient.alias("P1"),toy_df.drug.alias('D1'),mdf.patient,mdf.drug).show()

我很惊讶

+---+---+-------+----+
| P1| D1|patient|drug|
+---+---+-------+----+
| p2|  a|     p2|   a|
| p2|  b|     p2|   b|
| p2|  d|     p2|   d|
| p1|  a|     p1|   a|
| p1|  b|     p1|   b|
| p1|  c|     p1|   c|
+---+---+-------+----+

但是如果我使用

toy_df.join(mdf,["patient"],"left").show()

我确实看到了预期的行为

I do see the expected behavior

 patient|drug|drug|
+-------+----+----+
|     p2|   a|null|
|     p2|   b|null|
|     p2|   d|null|
|     p1|   a|   c|
|     p1|   b|   c|
|     p1|   c|   c|
+-------+----+----+

并且如果我在一个数据帧上使用别名表达式,我确实会获得预期的行为

and if I use an alias expression on one of the dataframes I do get the expected behavior

toy_df.join(mdf.alias('D'),on=["patient"],how="left").select(toy_df.patient.alias("P1"),toy_df.drug.alias("D1"),'D.drug').show()

| P1| D1|drug|
+---+---+----+
| p2|  a|null|
| p2|  b|null|
| p2|  d|null|
| p1|  a|   c|
| p1|  b|   c|
| p1|  c|   c|
+---+---+----+

所以我的问题是联接后选择列的最佳方法是什么,这是正常行为

So my question is what is the best way to select columns after join and is this behavior normal

edit:根据user8371915,这与标记为
的问题相同 Spark SQL执行笛卡尔联接而不是内部联接

edit : as per user8371915 this is same as the question tagged as
Spark SQL performing carthesian join instead of inner join

但是我的问题适用于两个具有相同世系并在调用show方法时执行联接但联接后选择列表现不同的数据框.

but my question works with two dataframe who have same lineage and performing the join when the show method is invoked but the select columns after join behaving differently .

推荐答案

最好的方法是使用别名:

The best way is to use aliases:

toy_df.alias("toy_df") \
    .join(mdf.alias("mdf"), ["patient"], "left") \
    .select(
        col("patient").alias("P1"),
        col("toy_df.drug").alias("D1"),
        col("patient").alias("patient"),
        col("mdf.drug").alias("drug")
    ) \
    .show()

问题在于mdf是从toy_df派生的,因此toy_df.drugmdf.drug都引用同一列.因此,当您将这些值传递给select时,Spark也会从同一列中返回值.

The problem is that mdf is derived from toy_df so both toy_df.drug and mdf.drug refer to the same column. Therefore, when you pass those to select, Spark returns values from the same column as well.

这篇关于Pyspark Join,然后列选择显示意外的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆