根据两列之间的时间差过滤pyspark数据帧 [英] Filter pyspark dataframe based on time difference between two columns

查看:80
本文介绍了根据两列之间的时间差过滤pyspark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含多列的数据框,其中两列属于 pyspark.sql.TimestampType 类型.我想将此数据框过滤到这两列之间的时间差小于一小时的行.

I have a dataframe with multiple columns, two of which are of type pyspark.sql.TimestampType. I would like to filter this dataframe to rows where the time difference between these two columns is less than one hour.

我目前正在尝试这样做:<代码>示例 = data.filter((data.tstamp - data.date) < datetime.timedelta(hours=1))

I'm currently trying to do this like so: examples = data.filter((data.tstamp - data.date) < datetime.timedelta(hours=1))

但这失败并显示以下错误消息:

But this fails with the following error message:

org.apache.spark.sql.AnalysisException: cannot resolve '(`tstamp` - `date`)' due to data type mismatch: '(`tstamp` - `date`)' requires (numeric or calendarinterval) type, not timestamp

实现此过滤器的正确方法是什么?

What is the correct method to achieve this filter?

推荐答案

你的列有不同的类型,很难解释不同的含义,通常时间戳是秒,日期是天.您可以事先将两列转换为 unix 时间戳以获得秒数差异:

Your columns have different types, it's difficult to interpret what the difference means, usually for timestamps it's seconds and for dates it's days. You can transform both columns to unix timestamps beforehand to get a difference in seconds:

import pyspark.sql.functions as psf
data.filter(
    psf.abs(psf.unix_timestamp(data.tstamp) - psf.unix_timestamp(data.date)) < 3600
)

编辑

此函数将适用于字符串(假设它们具有正确的格式)、时间戳和日期:

This function will work on strings (given they have a correct format), on timestamps and on dates:

import datetime
data = hc.createDataFrame(sc.parallelize([[datetime.datetime(2017,1,2,1,1,1), datetime.date(2017,8,7)]]), ['tstamp', 'date'])
data.printSchema()
    root
     |-- tstamp: timestamp (nullable = true)
     |-- date: date (nullable = true)

data.select(
    psf.unix_timestamp(data.tstamp).alias('tstamp'), psf.unix_timestamp(data.date).alias("date")
).show()
    +----------+----------+
    |    tstamp|      date|
    +----------+----------+
    |1483315261|1502056800|
    +----------+----------+

这篇关于根据两列之间的时间差过滤pyspark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆