激发到最近的假日的sql距离 [英] spark sql distance to nearest holiday
问题描述
在熊猫中,我有一个类似于
$ b
indices = df.dateColumn.apply(holidays.index.searchsorted)的函数
df ['nextHolidays'] = holidays.index [indices]
df ['previousHolidays'] = holidays.index [indices - 1]
它计算到最近的假期的距离并将其作为新列存储。
searchsorted
http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html 对熊猫来说是一个很好的解决方案,因为这给了我下一个假期的索引,而不会产生很高的算法复杂性。并行化熊猫适用例如这种方法比并行循环要快得多。
我如何在spark或hive中实现这个功能?
这可以使用聚合来完成,但此方法比pandas方法具有更高的复杂性。但是您可以使用UDF实现类似的性能。它不会像熊猫一样优雅,但:
假设假期的这个数据集:
节假日= ['2016-01-03','2016-09-09','2016-12-12','2016-03-03']
index = spark。 sparkContext.broadcast(排序(节假日))
2016 dataframe中的日期数据集:
from datetime import datetime,timedelta
dates_array = [(datetime(2016,1,1)+ timedelta(i))。 strftime('%Y-%m-%d')for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date = d)for d in date_array])
UDF可以使用pandas searchsorted $ c $但是需要在执行者上安装熊猫。 Insted你可以像这样使用Python的计划:
def nearest_holiday(date):
last_holiday = index.value [0 ]
用于index.value中的next_holiday:
if next_holiday> = date:
break
last_holiday = next_holiday
if last_holiday>日期:
last_holiday =无
if next_holiday< date:
next_holiday = None
return(last_holiday,next_holiday)
from pyspark.sql.types import *
return_type = StructType([StructField( 'last_holiday',StringType()),StructField('next_holiday',StringType())])
$ b从pyspark.sql.functions导入udf
nearest_holiday_udf = udf(nearest_holiday,return_type)
可以和 withColumn
一起使用:显示(5,False)
$ b
df.withColumn('holiday',nearest_holiday_udf('date')) $ b + ---------- + ----------------------- +
| date | holiday |
+ ---------- + ----------------------- +
| 2016-01-01 | [null,2016-01-03] |
| 2016-01-02 | [null,2016-01-03] |
| 2016-01-03 | [2016-01-03,2016-01-03] |
| 2016-01-04 | [2016-01-03,2016-03-03] |
| 2016-01-05 | [2016-01-03,2016-03-03] |
+ ---------- + ----------------------- +
只显示前5行
In pandas I have a function similar to
indices = df.dateColumn.apply(holidays.index.searchsorted)
df['nextHolidays'] = holidays.index[indices]
df['previousHolidays'] = holidays.index[indices - 1]
which calculates the distance to the nearest holiday and stores that as a new column.
searchsorted
http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html was a great solution for pandas as this gives me the index of the next holiday without a high algorithmic complexity Parallelize pandas apply e.g. this approach was a lot quicker then parallel looping.
How can I achieve this in spark or hive?
This can be done using aggregations but this method would have higher complexity than pandas method. But you can achieve similar performance using UDFs. It won't be as elegant as pandas, but:
Assuming this dataset of holidays:
holidays = ['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']
index = spark.sparkContext.broadcast(sorted(holidays))
And dataset of dates of 2016 in dataframe:
from datetime import datetime, timedelta
dates_array = [(datetime(2016, 1, 1) + timedelta(i)).strftime('%Y-%m-%d') for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date=d) for d in dates_array])
The UDF can use pandas searchsorted
but would need to install pandas on executors. Insted you can use plan python like this:
def nearest_holiday(date):
last_holiday = index.value[0]
for next_holiday in index.value:
if next_holiday >= date:
break
last_holiday = next_holiday
if last_holiday > date:
last_holiday = None
if next_holiday < date:
next_holiday = None
return (last_holiday, next_holiday)
from pyspark.sql.types import *
return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())])
from pyspark.sql.functions import udf
nearest_holiday_udf = udf(nearest_holiday, return_type)
And can be used with withColumn
:
df.withColumn('holiday', nearest_holiday_udf('date')).show(5, False)
+----------+-----------------------+
|date |holiday |
+----------+-----------------------+
|2016-01-01|[null,2016-01-03] |
|2016-01-02|[null,2016-01-03] |
|2016-01-03|[2016-01-03,2016-01-03]|
|2016-01-04|[2016-01-03,2016-03-03]|
|2016-01-05|[2016-01-03,2016-03-03]|
+----------+-----------------------+
only showing top 5 rows
这篇关于激发到最近的假日的sql距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!