激发到最近的假日的sql距离 [英] spark sql distance to nearest holiday

查看:222
本文介绍了激发到最近的假日的sql距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在熊猫中,我有一个类似于
$ b

  indices = df.dateColumn.apply(holidays.index.searchsorted)的函数
df ['nextHolidays'] = holidays.index [indices]
df ['previousHolidays'] = holidays.index [indices - 1]

它计算到最近的假期的距离并将其作为新列存储。



searchsorted http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html 对熊猫来说是一个很好的解决方案,因为这给了我下一个假期的索引,而不会产生很高的算法复杂性。并行化熊猫适用例如这种方法比并行循环要快得多。



我如何在spark或hive中实现这个功能?

解决方案

这可以使用聚合来完成,但此方法比pandas方法具有更高的复杂性。但是您可以使用UDF实现类似的性能。它不会像熊猫一样优雅,但:

假设假期的这个数据集:

 节假日= ['2016-01-03','2016-09-09','2016-12-12','2016-03-03'] 
index = spark。 sparkContext.broadcast(排序(节假日))

2016 dataframe中的日期数据集:

  from datetime import datetime,timedelta 
dates_array = [(datetime(2016,1,1)+ timedelta(i))。 strftime('%Y-%m-%d')for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date = d)for d in date_array])

UDF可以使用pandas searchsorted

  def nearest_holiday(date):
last_holiday = index.value [0 ]
用于index.value中的next_holiday:
if next_holiday> = date:
break
last_holiday = next_holiday
if last_holiday>日期:
last_holiday =无
if next_holiday< date:
next_holiday = None
return(last_holiday,next_holiday)


from pyspark.sql.types import *
return_type = StructType([StructField( 'last_holiday',StringType()),StructField('next_holiday',StringType())])
$ b从pyspark.sql.functions导入udf
nearest_holiday_udf = udf(nearest_holiday,return_type)

可以和 withColumn 一起使用: $ b

 df.withColumn('holiday',nearest_holiday_udf('date')) $ b + ---------- + ----------------------- + 
| date | holiday |
+ ---------- + ----------------------- +
| 2016-01-01 | [null,2016-01-03] |
| 2016-01-02 | [null,2016-01-03] |
| 2016-01-03 | [2016-01-03,2016-01-03] |
| 2016-01-04 | [2016-01-03,2016-03-03] |
| 2016-01-05 | [2016-01-03,2016-03-03] |
+ ---------- + ----------------------- +
只显示前5行


In pandas I have a function similar to

indices = df.dateColumn.apply(holidays.index.searchsorted)
df['nextHolidays'] = holidays.index[indices]
df['previousHolidays'] = holidays.index[indices - 1]

which calculates the distance to the nearest holiday and stores that as a new column.

searchsorted http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html was a great solution for pandas as this gives me the index of the next holiday without a high algorithmic complexity Parallelize pandas apply e.g. this approach was a lot quicker then parallel looping.

How can I achieve this in spark or hive?

解决方案

This can be done using aggregations but this method would have higher complexity than pandas method. But you can achieve similar performance using UDFs. It won't be as elegant as pandas, but:

Assuming this dataset of holidays:

holidays = ['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']
index = spark.sparkContext.broadcast(sorted(holidays))

And dataset of dates of 2016 in dataframe:

from datetime import datetime, timedelta
dates_array = [(datetime(2016, 1, 1) + timedelta(i)).strftime('%Y-%m-%d') for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date=d) for d in dates_array])

The UDF can use pandas searchsorted but would need to install pandas on executors. Insted you can use plan python like this:

def nearest_holiday(date):
    last_holiday = index.value[0]
    for next_holiday in index.value:
        if next_holiday >= date:
            break
        last_holiday = next_holiday
    if last_holiday > date:
        last_holiday = None
    if next_holiday < date:
        next_holiday = None
    return (last_holiday, next_holiday)


from pyspark.sql.types import *
return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())])

from pyspark.sql.functions import udf
nearest_holiday_udf = udf(nearest_holiday, return_type)

And can be used with withColumn:

df.withColumn('holiday', nearest_holiday_udf('date')).show(5, False)

+----------+-----------------------+
|date      |holiday                |
+----------+-----------------------+
|2016-01-01|[null,2016-01-03]      |
|2016-01-02|[null,2016-01-03]      |
|2016-01-03|[2016-01-03,2016-01-03]|
|2016-01-04|[2016-01-03,2016-03-03]|
|2016-01-05|[2016-01-03,2016-03-03]|
+----------+-----------------------+
only showing top 5 rows

这篇关于激发到最近的假日的sql距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆