如何从Pyspark中的一列日期中减去一列天? [英] How to subtract a column of days from a column of dates in Pyspark?

查看：32 发布时间：2021/11/14 22:37:54 python apache-spark pyspark apache-spark-sql user-defined-functions

本文介绍了如何从Pyspark中的一列日期中减去一列天?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给定以下 PySpark DataFrame

Given the following PySpark DataFrame

df = sqlContext.createDataFrame([('2015-01-15', 10),
                                 ('2015-02-15', 5)],
                                 ('date_col', 'days_col'))

如何从日期列中减去天数列?在此示例中，结果列应为 ['2015-01-05', '2015-02-10'].

How can the days column be subtracted from the date column? In this example, the resulting column should be ['2015-01-05', '2015-02-10'].

我查看了 pyspark.sql.functions.date_sub()，但它需要一个日期列和一天，即 date_sub(df['date_col'], 10).理想情况下，我更愿意做 date_sub(df['date_col'], df['days_col']).


I looked into pyspark.sql.functions.date_sub(), but it requires a date column and a single day, i.e. date_sub(df['date_col'], 10). Ideally, I'd prefer to do date_sub(df['date_col'], df['days_col']).
我也尝试过创建 UDF:
I also tried creating a UDF:
from datetime import timedelta
def subtract_date(start_date, days_to_subtract):
    return start_date - timedelta(days_to_subtract)

subtract_date_udf = udf(subtract_date, DateType())
df.withColumn('subtracted_dates', subtract_date_udf(df['date_col'], df['days_col'])

这在技术上是可行的，但我读过在 Spark 和 Python 之间进行步进可能会导致大型数据集的性能问题.我现在可以坚持使用这个解决方案(不需要过早地优化)，但我的直觉告诉我必须有一种方法可以在不使用 Python UDF 的情况下完成这个简单的事情.
This technically works, but I've read that stepping between Spark and Python can cause performance issues for large datasets. I can stick with this solution for now (no need to prematurely optimize), but my gut says there's just got to be a way to do this simple thing without using a Python UDF.
推荐答案
我能够使用 selectExpr 解决这个问题.
I was able to solve this using selectExpr.
df.selectExpr('date_sub(date_col, day_col) as subtracted_dates')

如果要将列附加到原始DF，只需在表达式中添加*
If you want to append the column to the original DF, just add * to the expression
df.selectExpr('*', 'date_sub(date_col, day_col) as subtracted_dates')


                        这篇关于如何从Pyspark中的一列日期中减去一列天?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何从Pyspark中的一列日期中减去一列天? [英] How to subtract a column of days from a column of dates in Pyspark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从Pyspark中的一列日期中减去一列天? [英] How to subtract a column of days from a column of dates in Pyspark?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭