带有or条件的函数时如何使用pyspark [英] how to use a pyspark when function with an or condition

查看:78
本文介绍了带有or条件的函数时如何使用pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用withColumn消除数据框中某列中的错误日期,我正在使用when()函数进行更新.对于不好的"约会,我有两个条件.1900年1月之前的日期或将来的日期.我当前的代码如下:

I'm trying to use withColumn to null out bad dates in a column in a dataframe, I'm using a when() function to make the update. I have two conditions for "bad" dates. dates before jan 1900 or dates in the future. My current code looks like this:

d = datetime.datetime.today()
df_out = df.withColumn(my_column, when(col(my_column) < '1900-01-01' | col(my_column) > '2019-12-09 17:01:37.774418', lit(None)).otherwise(col(my_column)))

我认为我的问题是它不喜欢or运算符"|".根据我在Google上看到的"|"这是我应该使用的.我也尝试过或".任何人都可以建议我在这里做错了什么.

I think my problem is that it doesn't like the or operator "|" . From what I have seen on google "|" is what i should use. I have tried "or" as well. Can anyone advise on what i'm doing wrong here.

这是堆栈跟踪.

df_out = df.withColumn(c, when(col(c) < '1900-01-01' | col(c) > '2019-12-09 17:01:37.774418', lit(None)).otherwise(col(c)))
  File "C:\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\column.py", line 115, in _
    njc = getattr(self._jc, name)(jc)
  File "C:\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py", line 1257, in __call__
  File "C:\spark-2.4.4-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\spark-2.4.4-bin-hadoop2.7\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 332, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o48.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist```

推荐答案

这只是运算符优先级的问题.错误告诉您 pyspark 无法将 OR 应用于字符串.更具体地说,它正在尝试计算'1900-01-01'|col(c)并告诉您它不知道该怎么做.您只需要在表达式上加上括号即可.

It's just a problem of priority of operators. The error is telling you that pyspark cannot apply OR to a string. More specifically, it is trying to compute '1900-01-01' | col(c) and tells you that it does not know how to do it. You simply need to parenthesize the expression.

df_out = df.withColumn(my_column, when((col(my_column) < '1900-01-01') | (col(my_column) > '2019-12-09 17:01:37.774418'), lit(None)).otherwise(col(my_column)))

这篇关于带有or条件的函数时如何使用pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆