如何在python中有效地合并两个具有容差的数据帧 [英] How to merge efficiently two dataframes with a tolerance in python
问题描述
目标:
我想使用 python 以有效的方式合并两个具有容差的数据帧 df1 和 df2.df1 的形状为 (l, 2),df2 的形状为 (p, 13),其中 l <米<页.我的形状为 (m, 13) 的目标数据框 df3 应该包含公差内的所有匹配项,而不仅仅是最接近的匹配项.
我想将 df1 的 Col0 与 df2 的 Col2 合并,并具有容差tolerance".
示例:
df1:
索引、Col0、Col10, 1008.5155, n01
df2:
Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...0, 0, 0, 510.0103, k03, 0, k05, k06, ...1, 0, 0, 1007.6176, k13, 0, k15, k16, ...2, 0, 0, 1008.6248, k123, 0, k25, k26, ...
df3:
Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...0, 1008.5155, 0.8979, 1007.6176, k03, n01, k05, k06, ...1, 1008.5155, 0.1093, 1008.6248, k13, n01, k15, k16, ...
为了形象化,df3 的 col1 给出了 df1 和 df2 各自值的差异.因此,它必须小于容差.
我当前的解决方案需要大量时间和大量内存.
# 创建空列表以收集匹配项df3_list = []df3_array = np.asarray(df3_list)# 循环查找匹配项.用匹配填充数组df3_row = np.asarray([0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0])对于范围内的 n(len(df1)):对于范围内的 k(len(df2)):如果 abs(df1.iloc[n,0]-df2.iloc[k,2]) <宽容:df3_row[0] = df1.iloc[n,0]df3_row[1] = abs(df1.iloc[n,0]-df2.iloc[k,2])df3_row[2] = df2.iloc[k,2]df3_row[3] = df2.iloc[k,3]df3_row[4] = df1.iloc[n,1]df3_row[5] = df2.iloc[k,5]...df3_array = np.append(df3_array, df3_row)# 将列表转换为数据框df3 = pd.DataFrame(df3_array.T.reshape(-1,13), columns = header)
我也尝试同时获得两个索引
[[n, k] for n, k in zip(range(len(df1)), range(len(df2))) if abs(df1.iloc[n,0]-df2.iloc[k,2]) <宽容]
然而,它只给了我一个空数组,所以我做错了.
对于各自的数组,我也尝试使用
np.nonzero(np.isclose(df2_array[:, 2], df1_array[:,:,None], atol=tolerance))[-1]
然而,np.isclose + np.nonzero 只为我提供了 df2 的索引,而且比我的循环密集型方法要多得多.没有 df1 的相应索引,我有点迷茫.我认为最后一种方法是最有前途的,但我似乎无法合并数据集,因为这些值不完全匹配,而且最接近的匹配并不总是正确的解决方案.任何想法如何克服这个问题?
你需要把这个问题分成几部分
- 找到对应的收盘指数
- 在这些索引上加入 DataFrames
- 做额外的计算
查找索引
使用np.isclose
,这是一个非常简单的生成器函数,它生成一个包含df1
和df2 索引的
DataFrame
对 df1
def find_close(df1, df1_col, df2, df2_col, tolerance=1):对于索引,df1[df1_col].items() 中的值:指数 = df2.index[np.isclose(df2[df2_col].values, value, atol=tolerance)]s = pd.DataFrame(data={'idx1': index, 'idx2': index.values})产量
然后我们可以轻松地连接这些以使用包含不同索引的辅助 DataFrame.
df_idx = pd.concat(find_close(df1, 'Col0', df2, 'Col2'), ignore_index=True)
为了测试这一点,我向 df1
df1_str = '''Index, Col0, Col10, 1008.5155, n011, 510, n03'''
<块引用>
idx1 idx20 0 11 0 22 1 0
加入数据帧
使用pd.merge
df1_close = pd.merge(df_idx, df1, left_on='idx1', right_index=True).reindex(columns=df1.columns)df2_close = pd.merge(df_idx, df2, left_on='idx2', right_index=True).reindex(columns=df2.columns)df_merged = pd.merge(df1_close, df2_close, left_index=True, right_index=True)
<块引用>
Col0_x Col1_x Col0_y Col1_y Col2 Col3 Col4 Col5 Col6 ...0 1008.5155 n01 0 0 1007.6176 k13 0 k15 k16 ...1 1008.5155 n01 0 0 1008.6248 k123 0 k25 k26 ...2 510.0 n03 0 0 510.0103 k03 0 k05 k06 ...
做额外的计算
您需要重命名几列,并在它们之间分配差异,但这应该很简单
Goal:
I want to merge two dataframes df1 and df2 with a tolerance in an efficient way using python. df1 has shape (l, 2) and df2 has shape (p, 13) with l < m < p. My target dataframe df3 with shape (m, 13) is supposed to contain all matches within the tolerance and not only the closest match.
I want to merge Col0 of df1 with Col2 of df2 with a tolerance "tolerance".
Example:
df1:
Index, Col0, Col1
0, 1008.5155, n01
df2:
Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 0, 0, 510.0103, k03, 0, k05, k06, ...
1, 0, 0, 1007.6176, k13, 0, k15, k16, ...
2, 0, 0, 1008.6248, k123, 0, k25, k26, ...
df3:
Index, Col0, Col1, Col2, Col3, Col4, Col5, Col6, ...
0, 1008.5155, 0.8979, 1007.6176, k03, n01, k05, k06, ...
1, 1008.5155, 0.1093, 1008.6248, k13, n01, k15, k16, ...
To visualize, col1 of df3 gives me the difference of the respective value of df1 and df2. Hence, it has to be smaller than the tolerance.
My current solution takes a lot of time and requires a lot of memory.
# Create empty list to collect matches
df3_list = []
df3_array = np.asarray(df3_list)
# loops to find matches. Fills array with matches
df3_row = np.asarray([0.0, 0.0, 0.0, 0.0, 0.0, 0, 0, 0, 0, 0, 0, 0, 0])
for n in range(len(df1)):
for k in range(len(df2)):
if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance:
df3_row[0] = df1.iloc[n,0]
df3_row[1] = abs(df1.iloc[n,0]-df2.iloc[k,2])
df3_row[2] = df2.iloc[k,2]
df3_row[3] = df2.iloc[k,3]
df3_row[4] = df1.iloc[n,1]
df3_row[5] = df2.iloc[k,5]
.
.
.
df3_array = np.append(df3_array, df3_row)
# convert list into dataframe
df3 = pd.DataFrame(df3_array.T.reshape(-1,13), columns = header)
I have also tried to get both indices in one go with
[[n, k] for n, k in zip(range(len(df1)), range(len(df2))) if abs(df1.iloc[n,0]-df2.iloc[k,2]) < tolerance]
However, it only gives me an empty array, so I am doing it wrong.
For the respective arrays, I have also tried to use
np.nonzero(np.isclose(df2_array[:, 2], df1_array[:,:,None], atol=tolerance))[-1]
However, np.isclose + np.nonzero only got me indices of df2 and also many more than with my loop-intensive approach. Without the corresponding indices of df1, I am kind of lost. I think this last approach is the most promising, yet I seem unable to merge the data set because the values are no exact match and because the closest match is not always the correct solution. Any ideas how to overcome this problem?
You need to divide this problem in parts
- Find the corresponding close indices
- Join the DataFrames on those indices
- do your extra calculations
Find the indices
using np.isclose
, this is a very simple generator function which yields a DataFrame
containing the index of df1
and df2
which are close for each row of df1
def find_close(df1, df1_col, df2, df2_col, tolerance=1):
for index, value in df1[df1_col].items():
indices = df2.index[np.isclose(df2[df2_col].values, value, atol=tolerance)]
s = pd.DataFrame(data={'idx1': index, 'idx2': indices.values})
yield s
Then we can easily concatenate these to get use a helper DataFrame containing the different indices.
df_idx = pd.concat(find_close(df1, 'Col0', df2, 'Col2'), ignore_index=True)
To test this I added a 2nd record to df1
df1_str = '''Index, Col0, Col1
0, 1008.5155, n01
1, 510, n03'''
idx1 idx2 0 0 1 1 0 2 2 1 0
Join the DataFrames
using pd.merge
df1_close = pd.merge(df_idx, df1, left_on='idx1', right_index=True).reindex(columns=df1.columns)
df2_close = pd.merge(df_idx, df2, left_on='idx2', right_index=True).reindex(columns=df2.columns)
df_merged = pd.merge(df1_close, df2_close, left_index=True, right_index=True)
Col0_x Col1_x Col0_y Col1_y Col2 Col3 Col4 Col5 Col6 ... 0 1008.5155 n01 0 0 1007.6176 k13 0 k15 k16 ... 1 1008.5155 n01 0 0 1008.6248 k123 0 k25 k26 ... 2 510.0 n03 0 0 510.0103 k03 0 k05 k06 ...
Do the extra calculations
You'll need to rename a few columns, and assign the diff between them, but that should be trivial
这篇关于如何在python中有效地合并两个具有容差的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!