基于时差的 pd.merge_asof() 不合并所有值 - Pandas [英] pd.merge_asof() based on Time-Difference not merging all values - Pandas

查看:66
本文介绍了基于时差的 pd.merge_asof() 不合并所有值 - Pandas的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据框,一个是新闻,另一个是股票价格.两个数据框都有一个日期"列.我想在 5 天的间隔内合并它们.

I have two dataframes, one with news and the other with stock price. Both the dataframes have a "Date" column. I want to merge them on a gap of 5 days.

假设我的新闻数据框是 df1,另一个价格数据框是 df2.

Lets say my news dataframe is df1 and the other price dataframe as df2.

我的 df1 看起来像这样:

My df1 looks like this:

News_Dates             News
2018-09-29     Huge blow to ABC Corp. as they lost the 2012 tax case
2018-09-30     ABC Corp. suffers a loss
2018-10-01     ABC Corp to Sell stakes
2018-12-20     We are going to comeback strong said ABC CEO
2018-12-22     Shares are down massively for ABC Corp.

我的 df2 看起来像这样:

My df2 looks like this:

  Dates             Price
2018-10-04           120
2018-12-24           131

我做的第一种合并方法是:

First method of merging I do is:

pd.merge_asof(df1_zscore.sort_values(by=['Dates']), df_n.sort_values(by=['News_Dates']), left_on=['Dates'], right_on=['News_Dates'] \
              tolerance=pd.Timedelta('5d'), direction='backward')

得到的 df 是:

  Dates       News_Dates   News                                     Price
2018-10-04    2018-10-01  ABC Corp to Sell stakes                    120
2018-12-24    2018-12-22  Shares are down massively for ABC Corp.    131

我做的第二种合并方式是:

The second way of merging I do is:

pd.merge_asof(df_n.sort_values(by=['Dates']), df1_zscore.sort_values(by=['Dates']), left_on=['News_Dates'], right_no=['Dates'] \
              tolerance=pd.Timedelta('5d'), direction='forward').dropna()

结果df为:

News_Dates            News                                                Dates      Price
2018-09-29     Huge blow to ABC Corp. as they lost the 2012 tax case    2018-10-04    120
2018-09-30     ABC Corp. suffers a loss                                 2018-10-04    120 
2018-10-01     ABC Corp to Sell stakes                                  2018-10-04    120
2018-12-22     Shares are down massively for ABC Corp.                  2018-12-24    131

两个合并结果都在单独的 dfs 中,但是两种情况下都有缺失的值,例如 10 月 4 日价格的第二种情况,9 月 29 日、9 月 30 日的新闻应该已合并.如果 12 月 24 日的价格 2 也应该合并 12 月 20 日的价格.

Both the merging results in separate dfs, however there are values in both the cases which are missing, like for second case for 4th October price, news from 29th, 30th Sept should have been merged. And in case 2 for 24th December price 20th December should also have been merged.

所以我不太能弄清楚我哪里出错了.

So I'm not quite able to figure out where am I going wrong.

附言我的目标是将价格 df 与价格日期后过去 5 天内出现的新闻 df 合并.

P.S. My objective is to merge the price df with the news df that have come in the last 5 days from the price date.

推荐答案

这是我使用 numpy 的解决方案

Here is my solution using numpy

df_n = pd.DataFrame([('2018-09-29', 'Huge blow to ABC Corp. as they lost the 2012 tax case'), ('2018-09-30', 'ABC Corp. suffers a loss'), ('2018-10-01', 'ABC Corp to Sell stakes'), ('2018-12-20', 'We are going to comeback strong said ABC CEO'), ('2018-12-22', 'Shares are down massively for ABC Corp.')], columns=('News_Dates', 'News'))
df1_zscore = pd.DataFrame([('2018-10-04', '120'), ('2018-12-24', '131')], columns=('Dates', 'Price'))

df_n["News_Dates"] = pd.to_datetime(df_n["News_Dates"])
df1_zscore["Dates"] = pd.to_datetime(df1_zscore["Dates"])

n_dates = df_n["News_Dates"].values
p_dates = df1_zscore[["Dates"]].values

## substract each pair of n_dates and p_dates and create a matrix
mat_date_compare = (p_dates - n_dates).astype('timedelta64[D]')

## get matrix of boolean for which difference is between 0 and 5 day
## to be used as index for original array
comparision =  (mat_date_compare <= pd.Timedelta("5d")) & (mat_date_compare >= pd.Timedelta("0d"))

## get cell numbers which is in range 0 to matrix size which meets the condition
ind = np.arange(len(n_dates)*len(p_dates))[comparision.ravel()]


## calculate row and column index from cell number to index the df
pd.concat([df1_zscore.iloc[ind//len(n_dates)].reset_index(drop=True), 
           df_n.iloc[ind%len(n_dates)].reset_index(drop=True)], sort=False, axis=1)

结果

Dates   Price   News_Dates  News
0   2018-10-04  120 2018-09-29  Huge blow to ABC Corp. as they lost the 2012 t...
1   2018-10-04  120 2018-09-30  ABC Corp. suffers a loss
2   2018-10-04  120 2018-10-01  ABC Corp to Sell stakes
3   2018-12-24  131 2018-12-20  We are going to comeback strong said ABC CEO
4   2018-12-24  131 2018-12-22  Shares are down massively for ABC Corp.

这篇关于基于时差的 pd.merge_asof() 不合并所有值 - Pandas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆