如何在python中有效地合并两个具有容差的数据帧 [英] How to merge efficiently two dataframes with a tolerance in python

查看:62
本文介绍了如何在python中有效地合并两个具有容差的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标:

我想使用 python 以有效的方式合并两个具有容差的数据帧 df1 和 df2.df1 的形状为 (l, 2),df2 的形状为 (p, 13),其中 l <米<页.我的形状为 (m, 13) 的目标数据框 df3 应该包含公差内的所有匹配项,而不仅仅是最接近的匹配项.

我想将 df1 的 Col0 与 df2 的 Col2 合并,并具有容差tolerance".

示例:

df1:

索引、Col0、Col10, 1008.5155, n01

df2:

Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...0, 0, 0, 510.0103, k03, 0, k05, k06, ...1, 0, 0, 1007.6176, k13, 0, k15, k16, ...2, 0, 0, 1008.6248, k123, 0, k25, k26, ...

df3:

Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...0, 1008.5155, 0.8979, 1007.6176, k03, n01, k05, k06, ...1, 1008.5155, 0.1093, 1008.6248, k13, n01, k15, k16, ...

为了形象化,df3 的 col1 给出了 df1 和 df2 各自值的差异.因此,它必须小于容差.

我当前的解决方案需要大量时间和大量内存.

 # 创建空列表以收集匹配项df3_list = []df3_array = np.asarray(df3_list)# 循环查找匹配项.用匹配填充数组df3_row = np.asarray([0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0])对于范围内的 n(len(df1)):对于范围内的 k(len(df2)):如果 abs(df1.iloc[n,0]-df2.iloc[k,2]) <宽容:df3_row[0] = df1.iloc[n,0]df3_row[1] = abs(df1.iloc[n,0]-df2.iloc[k,2])df3_row[2] = df2.iloc[k,2]df3_row[3] = df2.iloc[k,3]df3_row[4] = df1.iloc[n,1]df3_row[5] = df2.iloc[k,5]...df3_array = np.append(df3_array, df3_row)# 将列表转换为数据框df3 = pd.DataFrame(df3_array.T.reshape(-1,13), columns = header)

我也尝试同时获得两个索引

[[n, k] for n, k in zip(range(len(df1)), range(len(df2))) if abs(df1.iloc[n,0]-df2.iloc[k,2]) <宽容]

然而,它只给了我一个空数组,所以我做错了.

对于各自的数组,我也尝试使用

np.nonzero(np.isclose(df2_array[:, 2], df1_array[:,:,None], atol=tolerance))[-1]

然而,np.isclose + np.nonzero 只为我提供了 df2 的索引,而且比我的循环密集型方法要多得多.没有 df1 的相应索引,我有点迷茫.我认为最后一种方法是最有前途的,但我似乎无法合并数据集,因为这些值不完全匹配,而且最接近的匹配并不总是正确的解决方案.任何想法如何克服这个问题?

解决方案

你需要把这个问题分成几部分

  1. 找到对应的收盘指数
  2. 在这些索引上加入 DataFrames
  3. 做额外的计算

查找索引

使用np.isclose,这是一个非常简单的生成器函数,它生成一个包含df1df2 索引的DataFramedf1

的每一行都接近

def find_close(df1, df1_col, df2, df2_col, tolerance=1):对于索引,df1[df1_col].items() 中的值:指数 = df2.index[np.isclose(df2[df2_col].values, value, atol=tolerance)]s = pd.DataFrame(data={'idx1': index, 'idx2': index.values})产量

然后我们可以轻松地连接这些以使用包含不同索引的辅助 DataFrame.

df_idx = pd.concat(find_close(df1, 'Col0', df2, 'Col2'), ignore_index=True)

为了测试这一点,我向 df1

添加了第二条记录

df1_str = '''Index, Col0, Col10, 1008.5155, n011, 510, n03'''

<块引用>

 idx1 idx20 0 11 0 22 1 0

加入数据帧

使用pd.merge

df1_close = pd.merge(df_idx, df1, left_on='idx1', right_index=True).reindex(columns=df1.columns)df2_close = pd.merge(df_idx, df2, left_on='idx2', right_index=True).reindex(columns=df2.columns)df_merged = pd.merge(df1_close, df2_close, left_index=True, right_index=True)

<块引用>

 Col0_x Col1_x Col0_y Col1_y Col2 Col3 Col4 Col5 Col6 ...0 1008.5155 n01 0 0 1007.6176 k13 0 k15 k16 ...1 1008.5155 n01 0 0 1008.6248 k123 0 k25 k26 ...2 510.0 n03 0 0 510.0103 k03 0 k05 k06 ...

做额外的计算

您需要重命名几列,并在它们之间分配差异,但这应该很简单

Goal:

I want to merge two dataframes df1 and df2 with a tolerance in an efficient way using python. df1 has shape (l, 2) and df2 has shape (p, 13) with l < m < p. My target dataframe df3 with shape (m, 13) is supposed to contain all matches within the tolerance and not only the closest match.

I want to merge Col0 of df1 with Col2 of df2 with a tolerance "tolerance".

Example:

df1:

Index, Col0, Col1
0, 1008.5155, n01

df2:

Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 0, 0, 510.0103, k03, 0, k05, k06, ... 
1, 0, 0, 1007.6176, k13, 0, k15, k16, ...
2, 0, 0, 1008.6248, k123, 0, k25, k26, ...

df3:

Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 1008.5155, 0.8979, 1007.6176, k03, n01, k05, k06, ...
1, 1008.5155, 0.1093, 1008.6248, k13, n01, k15, k16, ...

To visualize, col1 of df3 gives me the difference of the respective value of df1 and df2. Hence, it has to be smaller than the tolerance.

My current solution takes a lot of time and requires a lot of memory.

 # Create empty list to collect matches
df3_list = []
df3_array = np.asarray(df3_list)

# loops to find matches. Fills array with matches
df3_row = np.asarray([0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0])

for n in range(len(df1)):
    for k in range(len(df2)):
        if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance:
            df3_row[0] = df1.iloc[n,0]
            df3_row[1] = abs(df1.iloc[n,0]-df2.iloc[k,2])
            df3_row[2] = df2.iloc[k,2]
            df3_row[3] = df2.iloc[k,3]
            df3_row[4] = df1.iloc[n,1]
            df3_row[5] = df2.iloc[k,5]
                       .
                       .
                       .

            df3_array = np.append(df3_array, df3_row)

# convert list into dataframe
df3 = pd.DataFrame(df3_array.T.reshape(-1,13), columns = header)

I have also tried to get both indices in one go with

[[n, k]  for n, k in zip(range(len(df1)), range(len(df2))) if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance]

However, it only gives me an empty array, so I am doing it wrong.

For the respective arrays, I have also tried to use

np.nonzero(np.isclose(df2_array[:, 2], df1_array[:,:,None], atol=tolerance))[-1]

However, np.isclose + np.nonzero only got me indices of df2 and also many more than with my loop-intensive approach. Without the corresponding indices of df1, I am kind of lost. I think this last approach is the most promising, yet I seem unable to merge the data set because the values are no exact match and because the closest match is not always the correct solution. Any ideas how to overcome this problem?

解决方案

You need to divide this problem in parts

  1. Find the corresponding close indices
  2. Join the DataFrames on those indices
  3. do your extra calculations

Find the indices

using np.isclose, this is a very simple generator function which yields a DataFrame containing the index of df1 and df2 which are close for each row of df1

def find_close(df1, df1_col, df2, df2_col, tolerance=1):
    for index, value in df1[df1_col].items():
        indices = df2.index[np.isclose(df2[df2_col].values, value, atol=tolerance)]
        s = pd.DataFrame(data={'idx1': index, 'idx2': indices.values})
        yield s

Then we can easily concatenate these to get use a helper DataFrame containing the different indices.

df_idx = pd.concat(find_close(df1, 'Col0', df2, 'Col2'), ignore_index=True)

To test this I added a 2nd record to df1

df1_str = '''Index, Col0, Col1
0, 1008.5155, n01
1, 510, n03'''

  idx1    idx2
0 0   1
1 0   2
2 1   0

Join the DataFrames

using pd.merge

df1_close = pd.merge(df_idx, df1, left_on='idx1', right_index=True).reindex(columns=df1.columns)
df2_close = pd.merge(df_idx, df2, left_on='idx2', right_index=True).reindex(columns=df2.columns)
df_merged = pd.merge(df1_close, df2_close, left_index=True, right_index=True)

  Col0_x  Col1_x  Col0_y  Col1_y  Col2    Col3    Col4    Col5    Col6    ...
0 1008.5155   n01 0   0   1007.6176   k13 0   k15 k16 ...
1 1008.5155   n01 0   0   1008.6248   k123    0   k25 k26 ...
2 510.0   n03 0   0   510.0103    k03 0   k05 k06 ...

Do the extra calculations

You'll need to rename a few columns, and assign the diff between them, but that should be trivial

这篇关于如何在python中有效地合并两个具有容差的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆