通过减去字符串格式的两个日期时间列来计算持续时间 [英] Calculating duration by subtracting two datetime columns in string format

查看:18
本文介绍了通过减去字符串格式的两个日期时间列来计算持续时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一系列日期的 Spark 数据框:

from pyspark.sql import SQLContext从 pyspark.sql 导入行从 pyspark.sql.types 导入 *sqlContext = SQLContext(sc)将熊猫导入为 pdrdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'),('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'),('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')])schema = StructType([StructField('ID', StringType(), True),StructField('EndDateTime', StringType(), True),StructField('StartDateTime', StringType(), True)])df = sqlContext.createDataFrame(rdd, schema)

我想要做的是通过减去EndDateTimeStartDateTime 来找到duration.我想我会尝试使用一个函数来做到这一点:

# 计算时间增量的函数def time_delta(y,x):结束 = pd.to_datetime(y)开始 = pd.to_datetime(x)delta =(结束开始)回报增量# 通过应用 time_delta 函数创建新的 RDD 并添加新的列持续时间"df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime))

然而,这只是给了我:

<预><代码>>>>df2.show()ID EndDateTime StartDateTime ANI 持续时间X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 nullX02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 nullX03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 nullXO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 nullXO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null

我不确定我的方法是否正确.如果没有,我很乐意接受另一种建议的方法来实现这一目标.

解决方案

从 Spark 1.5 开始,您可以使用 unix_timestamp:

from pyspark.sql 导入函数为 FtimeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)- F.unix_timestamp('StartDateTime', format=timeFmt))df = df.withColumn("持续时间", timeDiff)

注意 Java 风格的时间格式.

<预><代码>>>>df.show()+---+--------------------+--------------------+--------+|身份证|结束日期时间|开始日期时间|持续时间|+---+--------------------+--------------------+--------+|X01|2014-02-13T12:36:...|2014-02-13T12:31:...|258||X02|2014-02-13T12:35:...|2014-02-13T12:32:...|204||X03|2014-02-13T12:36:...|2014-02-13T12:32:...|228||XO4|2014-02-13T12:37:...|2014-02-13T12:32:...|269||XO5|2014-02-13T12:36:...|2014-02-13T12:33:...|202|+---+--------------------+--------------------+--------+

I have a Spark Dataframe in that consists of a series of dates:

from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
import pandas as pd

rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),
                                    ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'),
                                    ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),
                                    ('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'),
                                    ('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')])
schema = StructType([StructField('ID', StringType(), True),
                     StructField('EndDateTime', StringType(), True),
                     StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)

What I want to do is find duration by subtracting EndDateTime and StartDateTime. I figured I'd try and do this using a function:

# Function to calculate time delta
def time_delta(y,x): 
    end = pd.to_datetime(y)
    start = pd.to_datetime(x)
    delta = (end-start)
    return delta

# create new RDD and add new column 'Duration' by applying time_delta function
df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime)) 

However this just gives me:

>>> df2.show()
ID  EndDateTime          StartDateTime        ANI            Duration
X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 null    
X02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 null    
X03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 null    
XO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 null    
XO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null  

I'm not sure if my approach is correct or not. If not, I'd gladly accept another suggested way to achieve this.

解决方案

As of Spark 1.5 you can use unix_timestamp:

from pyspark.sql import functions as F
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
timeDiff = (F.unix_timestamp('EndDateTime', format=timeFmt)
            - F.unix_timestamp('StartDateTime', format=timeFmt))
df = df.withColumn("Duration", timeDiff)

Note the Java style time format.

>>> df.show()
+---+--------------------+--------------------+--------+
| ID|         EndDateTime|       StartDateTime|Duration|
+---+--------------------+--------------------+--------+
|X01|2014-02-13T12:36:...|2014-02-13T12:31:...|     258|
|X02|2014-02-13T12:35:...|2014-02-13T12:32:...|     204|
|X03|2014-02-13T12:36:...|2014-02-13T12:32:...|     228|
|XO4|2014-02-13T12:37:...|2014-02-13T12:32:...|     269|
|XO5|2014-02-13T12:36:...|2014-02-13T12:33:...|     202|
+---+--------------------+--------------------+--------+

这篇关于通过减去字符串格式的两个日期时间列来计算持续时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆