数据框上的多条件过滤器 [英] Multiple condition filter on dataframe
问题描述
谁能向我解释为什么这两个表达式得到不同的结果?我正在尝试在 2 个日期之间进行过滤:
Can anyone explain to me why I am getting different results for these 2 expressions ? I am trying to filter between 2 dates:
df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
.select("col1","col2").distinct().count()
结果:37M
对比
df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
.select("col1","col2").distinct().count()
结果:25M
它们有什么不同?在我看来他们应该产生相同的结果
How are they different ? It seems to me like they should produce the same result
推荐答案
TL;DR 将多个条件传递给 filter
或 where
使用Column
对象和逻辑运算符(&
、|
、~
).请参阅Pyspark:when 子句中的多个条件.
TL;DR To pass multiple conditions to filter
or where
use Column
objects and logical operators (&
, |
, ~
). See Pyspark: multiple conditions in when clause.
df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))
您也可以使用单个 SQL 字符串:
You can also use a single SQL string:
df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")
在实践中使用 between 更有意义:
In practice it makes more sense to use between:
df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")
第一种方法甚至不是远程有效的.在 Python 中,and
返回:
The first approach is not even remote valid. In Python, and
returns:
- 如果所有表达式都是真实的",则为最后一个元素.
- 否则为第一个falsey"元素.
结果
"act_date <='2017-04-01'" and "act_date >='2016-10-01'"
被评估为(任何非空字符串都是真的):
is evaluated to (any non-empty string is truthy):
"act_date >='2016-10-01'"
这篇关于数据框上的多条件过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!