平行大 pandas 适用 [英] Parallelize pandas apply

查看：71 发布时间：2020/5/24 2:34:47 python pandas parallel-processing apply embarrassingly-parallel

本文介绍了平行大 pandas 适用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对Pandas来说是新手，我已经想要并行化按行应用操作.到目前为止，我发现在pandas groupby之后并行应用但是，这似乎仅适用于分组数据帧.

我的用例是不同的:我有一个假期列表，并且对于我当前的行/日期，想要找到从这一天之前到第二天到下一个假期的无休日.

这是我通过apply调用的函数:

def get_nearest_holiday(x, pivot):
    nearestHoliday = min(x, key=lambda x: abs(x- pivot))
    difference = abs(nearesHoliday - pivot)
    return difference / np.timedelta64(1, 'D')

我如何加快速度?

编辑

我对pythons池进行了一些实验-但这既不是很好的代码，也没有得到我的计算结果.

解决方案

我认为，沿着并行尝试的方式进行尝试可能会使问题复杂化.我没有在大样本上尝试过这种方法，因此您的里程可能会有所不同，但这应该可以给您一个想法...

让我们从一些约会开始...

import pandas as pd

dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])

我们将使用pandas.tseries.holiday中的一些假日数据-请注意，实际上我们想要的是DatetimeIndex ...

from pandas.tseries.holiday import USFederalHolidayCalendar

holiday_calendar = USFederalHolidayCalendar()
holidays = holiday_calendar.holidays('2016-01-01')

这给了我们

DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
               '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
               '2016-11-24', '2016-12-26',
               ...
               '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
               '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
               '2030-11-28', '2030-12-25'],
              dtype='datetime64[ns]', length=150, freq=None)

现在，我们可以使用searchsorted找到原始日期最近的假期的索引:

indices = holidays.searchsorted(dates)
# array([1, 6, 9, 3])
next_nearest = holidays[indices]
# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)

然后取两者的区别:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
# array([15, 31, 14, 88])

您需要注意索引，以免浪费时间，并且对于上一个日期，请使用indices - 1进行计算，但是(我希望)它可以作为一个相对良好的基础. /p>

New to pandas, I already want to parallelize a row-wise apply operation. So far I found Parallelize apply after pandas groupby However, that only seems to work for grouped data frames.

My use case is different: I have a list of holidays and for my current row/date want to find the no-of-days before and after this day to the next holiday.

This is the function I call via apply:

def get_nearest_holiday(x, pivot):
    nearestHoliday = min(x, key=lambda x: abs(x- pivot))
    difference = abs(nearesHoliday - pivot)
    return difference / np.timedelta64(1, 'D')

How can I speed it up?

edit

I experimented a bit with pythons pools - but it was neither nice code, nor did I get my computed results.

解决方案

I think going down the route of trying stuff in parallel is probably over complicating this. I haven't tried this approach on a large sample so your mileage may vary, but it should give you an idea...

Let's just start with some dates...

import pandas as pd

dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03'])

We'll use some holiday data from pandas.tseries.holiday - note that in effect we want a DatetimeIndex...

from pandas.tseries.holiday import USFederalHolidayCalendar

holiday_calendar = USFederalHolidayCalendar()
holidays = holiday_calendar.holidays('2016-01-01')

This gives us:

DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30',
               '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11',
               '2016-11-24', '2016-12-26',
               ...
               '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27',
               '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11',
               '2030-11-28', '2030-12-25'],
              dtype='datetime64[ns]', length=150, freq=None)

Now we find the indices of the nearest nearest holiday for the original dates using searchsorted:

indices = holidays.searchsorted(dates)
# array([1, 6, 9, 3])
next_nearest = holidays[indices]
# DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None)

Then take the difference between the two:

next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days
# array([15, 31, 14, 88])

You'll need to be careful about the indices so you don't wrap around, and for the previous date, do the calculation with the indices - 1 but it should act as (I hope) a relatively good base.

这篇关于平行大 pandas 适用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

平行大 pandas 适用 [英] Parallelize pandas apply

问题描述

编辑

edit

相关文章

Python最新文章

热门教程

热门工具

登录关闭

平行大 pandas 适用 [英] Parallelize pandas apply

问题描述

编辑

edit

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭