基于另一个 DataFrame 在 DataFrame 上应用插值 [英] Applying interpolation on DataFrame based on another DataFrame

查看:50
本文介绍了基于另一个 DataFrame 在 DataFrame 上应用插值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 DataFrame,我想根据特定列的值以某种方式添加新列,其结果取决于 另一个 中包含的数据 DataFrame.

更具体地说,我有

df_original =Crncy价差持续时间0 欧元 100 1.21南南南2 100 3.463 瑞士法郎 200 2.54 美元 50 5.0...df_interpolation =CRNCY TENOR Adj_EUR Adj_USD0 欧元 1 10 201 欧元 2 20 302 欧元 5 30 403 欧元 7 40 50...10 瑞士法郎 1 50 1011 瑞士法郎 2 60 2012 瑞士法郎 5 70 30...

现在想根据 Crncy 的值,为每一行添加列 Adj_EURAdj_USDdf_originalDuration 使用标准线性插值.

因此,我们希望使用 df_interpolationDuration<中的 TENORAdj_USD/Adj_EUR/code> 来自 df_original,对于每个可用的 Crncy,形成插值.

例如使用来自 scipyoptimize-package 的伪代码:

from scipy import 优化""" 对 'Adj_EUR' 和 'Adj_USD' 执行此操作 """# 对于 'Adj_EUR'对于 curr, df_original.groupby('Crncy') 中的 df:x_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['TENOR'])y_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_EUR'])"""线性拟合"""z_linear = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data.ravel())[0]""" 以某种方式将值添加回新列 """ 中的 df_originaldf['Adj_EUR'] = z_linear[0] + z_linear[1] * df['Duration']

产量

 Crncy Spread Duration Adj_EUR Adj_USD0 欧元 100 1.2 12 221 南南南 0.0 0.0...

任何关于如何做到这一点的线索?

非常值得

解决方案

假设我们有 df1df2

<预><代码>>>>df1Crncy价差持续时间0 欧元 100 1.21 瑞士法郎 200 2.5>>>df2CRNCY TENOR Adj_EUR Adj_USD0 欧元 1 10 201 欧元 2 20 302 欧元 5 30 403 欧元 7 40 504 瑞士法郎 1 50 105 瑞士法郎 2 60 206 瑞士法郎 5 70 30

df1df2 转换成相似的数据帧

df1['Adj_EUR'] = np.nandf1['Adj_USD'] = np.nandf1['左'] = 1>>>df1Crncy 点差持续时间 Adj_EUR Adj_USD 剩余0 欧元 100 1.2 NaN NaN 11 瑞士法郎 200 2.5 NaN NaN 1df2 = df2.rename(columns={'CRNCY': 'Crncy', 'TENOR': 'Duration'})df2['Spread'] = np.nandf2['左'] = 0>>>df2Crncy Duration Adj_EUR Adj_USD 剩余点差0 欧元 1 10 20 NaN 01 欧元 2 20 30 NaN 02 欧元 5 30 40 NaN 03 欧元 7 40 50 NaN 04 瑞士法郎 1 50 10 NaN 05 瑞士法郎 2 60 20 纳米 06 瑞士法郎 5 70 30 纳米 0

现在连接 df1df2 行方向.

df3 = pd.concat([df1, df2], ignore_index=True, sort=False).sort_values(['Crncy', 'Duration'])>>>df3Crncy 点差持续时间 Adj_EUR Adj_USD 剩余6 瑞士法郎 NaN 1.0 50.0 10.0 07 瑞士法郎 NaN 2.0 60.0 20.0 01 瑞士法郎 200.0 2.5 NaN NaN 18 瑞士法郎 NaN 5.0 70.0 30.0 02 欧元 NaN 1.0 10.0 20.0 00 欧元 100.0 1.2 NaN NaN 13 欧元 NaN 2.0 20.0 30.0 04 欧元 NaN 5.0 30.0 40.0 05 欧元 NaN 7.0 40.0 50.0 0

然后使用Duration对每列的NaN值进行插值,然后删除不需要的列:

df3 = df3.set_index('持续时间')df4 = df3.groupby(['Crncy']).apply(lambda x: x.interpolate(method='index')).reset_index()df4 = df4[['Crncy', 'Spread', 'Duration', 'Adj_EUR', 'Adj_USD', 'left']]df4 = df4.loc[df4['left'] == 1].drop('left',axis=1).reset_index(drop=True)>>>df4Crncy 点差久期 Adj_EUR Adj_USD0 瑞士法郎 200.0 2.5 61.666667 21.6666671 欧元 100.0 1.2 12.000000 22.000000

希望这会有所帮助.

I have a DataFrame on which I would like to somehow add new columns based on the value of a specific column, whose result depends on data contained in another DataFrame.

More specifically, I have

df_original = 

    Crncy  Spread  Duration
0   EUR    100     1.2
1   nan    nan     nan
2          100     3.46
3   CHF    200     2.5
4   USD    50      5.0
...

df_interpolation = 

    CRNCY  TENOR   Adj_EUR   Adj_USD
0   EUR    1       10        20    
1   EUR    2       20        30  
2   EUR    5       30        40  
3   EUR    7       40        50  
...
10  CHF    1       50        10  
11  CHF    2       60        20  
12  CHF    5       70        30  
...

and would now like to add the columns Adj_EUR and Adj_USD to df_original for each row, based on the value of Crncy and Duration using standard linear interpolation.

So, we want to use TENOR and Adj_USD/Adj_EUR from df_interpolation and Duration from df_original, for each available Crncy, to form the interpolation.

E.g. Pseudo-code using optimize-package from scipy:

from scipy import optimize

""" Do this for both 'Adj_EUR' and 'Adj_USD' """

# For 'Adj_EUR'
for curr, df in df_original.groupby('Crncy'):

    x_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['TENOR'])
    y_data = df_interpolation[df_interpolation['CRNCY']==curr].as_matrix(['Adj_EUR'])

    """ Linear fit """
    z_linear = optimize.curve_fit(lambda t,a,b: a + b * t, x_data.ravel(), y_data.ravel())[0]
    """ Somehow add the values back to df_original in a new column """
    df['Adj_EUR'] = z_linear[0] + z_linear[1] * df['Duration']

Yielding

    Crncy  Spread  Duration  Adj_EUR  Adj_USD
0   EUR    100     1.2       12       22
1   nan    nan     nan       0.0      0.0
...

Any clue on how to do this?

Much appreciable

解决方案

Suppose we have df1 and df2

>>> df1
  Crncy  Spread  Duration
0   EUR     100       1.2
1   CHF     200       2.5


>>> df2
  CRNCY  TENOR  Adj_EUR  Adj_USD
0   EUR      1       10       20
1   EUR      2       20       30
2   EUR      5       30       40
3   EUR      7       40       50
4   CHF      1       50       10
5   CHF      2       60       20
6   CHF      5       70       30

Transform df1 and df2 into similar dataframes

df1['Adj_EUR'] = np.nan
df1['Adj_USD'] = np.nan
df1['left'] = 1

>>> df1
  Crncy  Spread  Duration  Adj_EUR  Adj_USD  left
0   EUR     100       1.2      NaN      NaN     1
1   CHF     200       2.5      NaN      NaN     1

df2 = df2.rename(columns={'CRNCY': 'Crncy', 'TENOR': 'Duration'})
df2['Spread'] = np.nan
df2['left'] = 0

>>> df2
  Crncy  Duration  Adj_EUR  Adj_USD  Spread  left
0   EUR         1       10       20     NaN     0
1   EUR         2       20       30     NaN     0
2   EUR         5       30       40     NaN     0
3   EUR         7       40       50     NaN     0
4   CHF         1       50       10     NaN     0
5   CHF         2       60       20     NaN     0
6   CHF         5       70       30     NaN     0

Now concat df1 and df2 row direction.

df3 = pd.concat([df1, df2], ignore_index=True, sort=False).sort_values(['Crncy', 'Duration'])

>>> df3
  Crncy  Spread  Duration  Adj_EUR  Adj_USD  left
6   CHF     NaN       1.0     50.0     10.0     0
7   CHF     NaN       2.0     60.0     20.0     0
1   CHF   200.0       2.5      NaN      NaN     1
8   CHF     NaN       5.0     70.0     30.0     0
2   EUR     NaN       1.0     10.0     20.0     0
0   EUR   100.0       1.2      NaN      NaN     1
3   EUR     NaN       2.0     20.0     30.0     0
4   EUR     NaN       5.0     30.0     40.0     0
5   EUR     NaN       7.0     40.0     50.0     0

And then interpolate NaN values of each column using Duration, and then drop unnecessary columns:

df3 = df3.set_index('Duration')
df4 = df3.groupby(['Crncy']).apply(lambda x: x.interpolate(method='index')).reset_index()
df4 = df4[['Crncy', 'Spread', 'Duration', 'Adj_EUR', 'Adj_USD', 'left']]
df4 = df4.loc[df4['left'] == 1].drop('left', axis=1).reset_index(drop=True)

>>> df4
  Crncy  Spread  Duration    Adj_EUR    Adj_USD
0   CHF   200.0       2.5  61.666667  21.666667
1   EUR   100.0       1.2  12.000000  22.000000

Hope this helps.

这篇关于基于另一个 DataFrame 在 DataFrame 上应用插值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆