pyspark 检查 HH:mm:ss 是否在一个范围内 [英] pyspark check if HH:mm:ss is in a range

查看:63
本文介绍了pyspark 检查 HH:mm:ss 是否在一个范围内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些数据看起来像这样.

I have some data looks like this.

time
08:28:24
22:20:54 
12:59:38
21:46:07

我想选择 16:00:00 到 23:59:59 之间的时间,这是一个封闭的范围.

I want to select the time that stand between 16:00:00 and 23:59:59, this is a closed range.

我该怎么办?('时间'列类型是字符串.)

What should i do with it? ('Time' column type is string.)

谢谢!

推荐答案

您的条件可以简化为检查 time 列的小时部分是否在 1623 之间.

Your condition can be simplified to checking if the hour part of your time column is between 16 and 23.

您可以使用 pyspark.sql.functions.split 标记 : 字符上的 time 列.提取索引 0 处的令牌以获取小时,并使用 pyspark.sql.Column.between()(包括边界).

You can get the hour by using pyspark.sql.functions.split to tokenize the time column on the : character. Extract the token at index 0 to get the hour, and make the comparison using pyspark.sql.Column.between() (which is inclusive of the bounds).

from pyspark.sql.functions import split
df.where(split("time", ":")[0].between(16, 23)).show()
#+--------+
#|    time|
#+--------+
#|22:20:54|
#|21:46:07|
#+--------+

请注意,即使 split 返回一个字符串,也会隐式转换为 int 以进行 between 比较.

Note that even though split returns a string, there is an implicit conversion to int to do the between comparison.

当然,如果您有更复杂的过滤条件,包括查看分钟或秒,则可以扩展此功能:

Of course, this could be extended if you had more complicated filtering criteria that also involved looking at minutes or seconds:

df.select(
    "*",
    split("time", ":")[0].cast("int").alias("hour"),
    split("time", ":")[1].cast("int").alias("minute"),
    split("time", ":")[2].cast("int").alias("second")
).show()
#+--------+----+------+------+
#|    time|hour|minute|second|
#+--------+----+------+------+
#|08:28:24|   8|    28|    24|
#|22:20:54|  22|    20|    54|
#|12:59:38|  12|    59|    38|
#|21:46:07|  21|    46|     7|
#+--------+----+------+------+

这篇关于pyspark 检查 HH:mm:ss 是否在一个范围内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆