在2个数据框列上进行scala-外部联接不会显示存在空值的行 [英] scala- Outer join on 2 dataframe columns doesnt show rows where there are null values
问题描述
我这样加入2个数据帧:
val joinCols = Array( first_name, last_name)
val df_subset_joined = df1_subset.as( a)。join(df2_subset.as( b),joinCols, full_outer)
df_subset_joined.show()
Im joining 2 dataframes like so: val joinCols = Array("first_name", "last_name") val df_subset_joined = df1_subset.as("a").join(df2_subset.as("b"), joinCols, "full_outer") df_subset_joined.show()
这是上面代码的结果:
Dataframe of differences between 2 dataframes
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| will | smith| 67| 67|
| george | clooney| 67| 67|
| george | clooney| 67| 88|
| blake | lively| 66| null|
| celena| gomez| null| 2|
| eva| green| 44| 56|
| null| null| | null|
| jason| momoa| 34| 34|
| ed| sheeran| 88| null|
| lionel| messi| 88| 88|
| kyle| jenner| null| 56|
| tom | cruise| 66| 34|
| tom | cruise| 66| 99|
| brad| pitt| 99| 78|
| ryan| reynolds| 45| null|
+----------+---------+-------------+-------------+
如您所见,存在带有空值的列。
As you can see there are columns with null values.
接下来我运行以下代码:
I run the following code next:
val filter_str = s"a.$col"+" != "+s"b.$col"
val df_subset_filtered = df_subset_joined.filter(filter_str)
df_subset_filtered.show()
我得到了foll数据帧:
I get the foll dataframe:
Below is the dataframe of differences between DF1 and DF1 based on the comparison between:
a.loyalty_score != b.loyalty_score
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| tom | cruise| 66| 99|
| tom | cruise| 66| 34|
| eva| green| 44| 56|
| brad| pitt| 99| 78|
| george | clooney| 67| 88|
+----------+---------+-------------+-------------+
为什么我看不到在1列中有空值和实际值的行值另一个。不应该满足这个值!= null
Why dont I see the rows where there are null values in 1 column and a actual value in another. Shouldnt this satisfy value != null
如何使我的过滤器语句使空值出现在最终数据框中
How can I make my filter statement make the null values appear in the final dataframe
推荐答案
一列中没有 null
而没有 non的任何行的原因-null
另一个是比较返回 FALSE
。
The reason you don't get any rows where there is null
in one column and non-null
in the other is that the comparison returns FALSE
.
要避免这种情况,请使用空安全比较运算符 < =>
,与 not
结合使用。
To avoid this, use the null-safe comparison operator <=>
, in conjunction with not
.
val filter_str = "not(" + s"a.$col"+" <=> "+s"b.$col)"
val df_subset_filtered = df_subset_joined.filter(filter_str)
df_subset_filtered.show()
从文档中
expr1< => expr2-对于非空操作数,返回与EQUAL(=)运算符相同的结果,但如果两者均为null,则返回true一世f其中一个为空。
expr1 <=> expr2 - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
参数:
expr1,expr2-两个表达式必须相同type或可以强制转换为通用类型,并且必须是可在相等比较中使用的类型。不支持地图类型。对于诸如数组/结构之类的复杂类型,字段的数据类型必须是可排序的。
示例:
expr1, expr2 - the two expressions must be same type or can be casted to a common type, and must be a type that can be used in equality comparison. Map type is not supported. For complex types such array/struct, the data types of fields must be orderable. Examples:
SELECT 2< => 2;
true
SELECT 2 <=> 2; true
选择1< =>‘1’;
true
SELECT 1 <=> '1'; true
SELECT true< => NULL;
false
SELECT true <=> NULL; false
选择NULL< => NULL;
true
SELECT NULL <=> NULL; true
这篇关于在2个数据框列上进行scala-外部联接不会显示存在空值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!