比较数据框中的日期 [英] Compare dates in dataframes

查看:114
本文介绍了比较数据框中的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Scala中有两个数据框:

I have two dataframes in Scala:

df1 =

ID  start_date_time
1   2016-10-12 11:55:23
2   2016-10-12 12:25:00
3   2016-10-12 16:20:00

df2 =

PK  start_date
1   2016-10-12
2   2016-10-14

如果满足以下条件,我需要向 df1 添加一个新列,其值为 0 失败,否则 - > 1

I need to add a new column to df1 that will have value 0 if the following condition fails, otherwise -> 1:


如果 ID == PK start_date_time 指的是同一年,月和日的 start_date

If ID == PK and start_date_time refers to the same year, month and day as start_date.

结果应该是这样的:

df1 =

ID  start_date_time      check
1   2016-10-12-11-55-23  1
2   2016-10-12-12-25-00  0
3   2016-10-12-16-20-00  0

我该怎么做?

我假设逻辑应该是这样的:

I assume that the logic should be something like this:

    df1 = df.withColumn("check", define(df("ID"),df("start_date")))

    val define = udf {(id: String,dateString:String) =>
       val formatter = new SimpleDateFormat("yyyy-MM-dd")
       val date = formatter.format(dateString)
       val checks = df2.filter(df2("PK")===ID).filter(df2("start_date_time")===date)
       if(checks.collect().length>0) "1" else "0"
    }

但是,我对如何比较日期有疑问,因为 df1 df2 具有不同格式的日期。如何更好地实现?

However, I have doubts regarding how to compare dates, because df1 and df2 have differently formatted dates. How to better implement it?

推荐答案

您可以使用 spark datetime函数 df1 df2 然后在 df1,df2 上进行左连接,这里您创建一个额外的常数列在df2上检查,以指示结果中是否有匹配:

You can use spark datetime functions to create date columns on both df1 and df2 and then do a left join on df1, df2, here you create an extra constant column check on df2 to indicate if there is a match in the result:

import org.apache.spark.sql.functions.lit

val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
                    withColumn("check", lit(1)).
                    select($"PK".as("ID"), $"date", $"check"))

df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show

+---+--------------------+-----+
| ID|     start_date_time|check|
+---+--------------------+-----+
|  1|2016-10-12 11:55:...|    1|
|  2|2016-10-12 12:25:...|    0|
|  3|2016-10-12 16:20:...|    0|
+---+--------------------+-----+

这篇关于比较数据框中的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆