在字符串格式减去两个日期时间列时间计算 [英] Calculating duration by subtracting two datetime columns in string format

查看:323
本文介绍了在字符串格式减去两个日期时间列时间计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个星火数据框由一系列日期的:

I have a Spark Dataframe in that consists of a series of dates:

from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
import pandas as pd

rdd = sc.parallelizesc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876','sip:4534454450'),
                                    ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321','sip:6413445440'),
                                    ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229','sip:4534437492'),
                                    ('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881','sip:6474454453'),
                                    ('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323','sip:8874458555')])
schema = StructType([StructField('ID', StringType(), True),
                     StructField('EndDateTime', StringType(), True),
                     StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)

我想要做的就是找​​到持续时间减去 EndDateTime 的startDateTime 。我想我会尝试做到这一点使用功能:

What I want to do is find duration by subtracting EndDateTime and StartDateTime. I figured I'd try and do this using a function:

# Function to calculate time delta
def time_delta(y,x): 
    end = pd.to_datetime(y)
    start = pd.to_datetime(x)
    delta = (end-start)
    return delta

# create new RDD and add new column 'Duration' by applying time_delta function
df2 = df.withColumn('Duration', time_delta(df.EndDateTime, df.StartDateTime)) 

然而,这只是给了我:

However this just gives me:

>>> df2.show()
ID  EndDateTime          StartDateTime        ANI            Duration
X01 2014-02-13T12:36:... 2014-02-13T12:31:... sip:4534454450 null    
X02 2014-02-13T12:35:... 2014-02-13T12:32:... sip:6413445440 null    
X03 2014-02-13T12:36:... 2014-02-13T12:32:... sip:4534437492 null    
XO4 2014-02-13T12:37:... 2014-02-13T12:32:... sip:6474454453 null    
XO5 2014-02-13T12:36:... 2014-02-13T12:33:... sip:8874458555 null  

我不知道如果我的做法是正确与否。如果没有,我会欣然接受另一种方式建议实现这一目标。

I'm not sure if my approach is correct or not. If not, I'd gladly accept another suggested way to achieve this.

推荐答案

感谢大卫格里芬。以下是如何以供将来参考做到这一点。

Thanks to David Griffin. Here's how to do this for future reference.

from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
from pyspark.sql.functions import udf

# Build sample data
rdd = sc.parallelize([('X01','2014-02-13T12:36:14.899','2014-02-13T12:31:56.876'),
                      ('X02','2014-02-13T12:35:37.405','2014-02-13T12:32:13.321'),
                      ('X03','2014-02-13T12:36:03.825','2014-02-13T12:32:15.229'),
                      ('XO4','2014-02-13T12:37:05.460','2014-02-13T12:32:36.881'),
                      ('XO5','2014-02-13T12:36:52.721','2014-02-13T12:33:30.323')])
schema = StructType([StructField('ID', StringType(), True),
                     StructField('EndDateTime', StringType(), True),
                     StructField('StartDateTime', StringType(), True)])
df = sqlContext.createDataFrame(rdd, schema)

# define timedelta function (obtain duration in seconds)
def time_delta(y,x): 
    from datetime import datetime
    end = datetime.strptime(y, '%Y-%m-%dT%H:%M:%S.%f')
    start = datetime.strptime(x, '%Y-%m-%dT%H:%M:%S.%f')
    delta = (end-start).total_seconds()
    return delta

# register as a UDF 
f = udf(time_delta, IntegerType())

# Apply function
df2 = df.withColumn('Duration', f(df.EndDateTime, df.StartDateTime)) 

应用 TIME_DELTA()会给你时间以秒为单位:

Applying time_delta() will give you duration in seconds:

>>> df2.show()
ID  EndDateTime          StartDateTime        Duration
X01 2014-02-13T12:36:... 2014-02-13T12:31:... 258     
X02 2014-02-13T12:35:... 2014-02-13T12:32:... 204     
X03 2014-02-13T12:36:... 2014-02-13T12:32:... 228     
XO4 2014-02-13T12:37:... 2014-02-13T12:32:... 268     
XO5 2014-02-13T12:36:... 2014-02-13T12:33:... 202 

这篇关于在字符串格式减去两个日期时间列时间计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆