如何从Pyspark中的日期列中减去日期列? [英] How to subtract a column of days from a column of dates in Pyspark?
问题描述
给出以下PySpark DataFrame
Given the following PySpark DataFrame
df = sqlContext.createDataFrame([('2015-01-15', 10),
('2015-02-15', 5)],
('date_col', 'days_col'))
如何从日期列中减去天列?在此示例中,结果列应为['2015-01-05', '2015-02-10']
.
How can the days column be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10']
.
我查看了pyspark.sql.functions.date_sub()
,但是它需要一个日期列和一天,即date_sub(df['date_col'], 10)
.理想情况下,我更愿意做date_sub(df['date_col'], df['days_col'])
.
I looked into pyspark.sql.functions.date_sub()
, but it requires a date column and a single day, i.e. date_sub(df['date_col'], 10)
. Ideally, I'd prefer to do date_sub(df['date_col'], df['days_col'])
.
我还尝试创建UDF:
from datetime import timedelta
def subtract_date(start_date, days_to_subtract):
return start_date - timedelta(days_to_subtract)
subtract_date_udf = udf(subtract_date, DateType())
df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col'])
从技术上讲这是可行的,但是我已经读过Spark和Python之间的过渡会导致大型数据集的性能问题.我现在可以坚持使用该解决方案(无需过早优化),但是我的直觉说,必须有一种无需使用Python UDF即可完成此简单操作的方法.
This technically works, but I've read that stepping between Spark and Python can cause performance issues for large datasets. I can stick with this solution for now (no need to prematurely optimize), but my gut says there's just got to be a way to do this simple thing without using a Python UDF.
推荐答案
我能够使用selectExpr
解决此问题.
I was able to solve this using selectExpr
.
df.selectExpr('date_sub(date_col, day_col) as subtracted_dates')
如果要将列添加到原始DF,只需在表达式中添加*
If you want to append the column to the original DF, just add *
to the expression
df.selectExpr('*', 'date_sub(date_col, day_col) as subtracted_dates')
这篇关于如何从Pyspark中的日期列中减去日期列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!