按复合条件合并/加入2个DataFrames [英] Merge/Join 2 DataFrames by complex criteria
问题描述
我目前的策略是按照常见的排序字段,然后为循环运行嵌套,执行条件
如果
测试,则会将预定义的词汇与已找到的项目进行聚合,不符合
示例:
将大熊猫导入为pd
list1 = [{'a':56,'b':'38','c':'11','d':'10','e':65},
{'a':31,'b':'12','c':'26','d':'99','e':71},
{'a' ,'b':'49','c':'40','d':'227','e':1},
{'a':3,'b' ,'c':'32','d':'46','e':70},]
list2 = [{'a':56,'b':'38' :'11','d':'10','e':65},
{'a':145,'b':'108','c' '123','d':'84','e':3},
{'a':113,'b':'144','c':'183','d' '7','e':12},
{'a':144,'b':'60','c':'46','d' 148},
{'a':57,'b':'87','c':'51','d':'95','e':187},
{ 'a':41,'b':'12','c':'26','d':'99','e':71},
{'a':80,'b ':'49','c':'40','d':'227','e':1},
{'a':3,'b':'85' ':'32','d':'46','e':70},
{'a':107,'b':'95','c' ':'15','e':25},
{'a':138,'b':'97','c':'38','d':'28' ':171}]
re_dict = dict([('found',[]),('alien',[])])
在列表2中的L2:
在list1中的L1:
if(L1 ['a'] - 5 <= L1 ['c'] [ - 1:]中的L2 ['a'] <= L2 ['a'] + 10)和L2 ['c'] [ - 1:]:
if(65 < = L2 ['e'] <= 75):
L2.update({'e':'some value'})
re_dict ['found']。append(L2)
list1.remove(L1)
break#从内部循环中释放
else:如果内部循环遍历整个列表,则没有匹配
re_dict ['alien'] .append(L2)
以上产生所需结果:
re_dict
pre>
{'alien':[{'a':145,'b':'108','c':'123','d' :'84','e':3},
{'a':113,'b':'144','c':'183','d' :12},
{'a':57,'b':'87','c':'51','d':'95','e':187},
{'a':41,'b':'12','c':'26','d':'99','e':71},
{'a' b':'95','c':'81' d':'15','e':25},
{'a':138,'b':'97','c':'38' e':171}],
'found':[{'a':56,'b':'38','c':'11','d':'10' :'some value'},
{'a':144,'b':'60','c':'46','d':'106','e' b $ b {'a':80,'b':'49','c':'40','d':'227','e':1},
{'a' 3,'b':'85','c':'32','d':'46','e':'some value'}]}
所以它做的工作,但显然不是很有效率,似乎是理想的工作,
熊猫
。 / p>
我认为如果我可以合并/加入两个
DataFrames
是理想的,但是我无法理解如何合并复杂的标准。另外我的数据集大小不一样。
示例:
df1 = pd.DataFrame(list1)
df2 = pd.DataFrame(list2)
pd.merge(df1,df2,on ='d',how ='outer')
a_x b_x c_x d e_x a_y b_y c_y e_y
0 56 38 11 10 65 56 38 11 65
1 31 12 26 99 71 41 12 26 71
2 70 49 40 227 1 80 49 40 1
3 3 85 32 46 70 3 85 32 70
4 NaN NaN Na Na 84 NaN 145 108 123 3
5 NaN NaN NaN 7 NaN 113 144 183 12
6 NaN NaN NaN 106 NaN 144 60 46 148
7 NaN NaN NaN 95 NaN 57 87 51 187
8 NaN NaN NaN 15 NaN 107 95 81 25
9 NaN NaN NaN 28 NaN 138 97 38 171
只有当 d 列在两个
df1
和df2
。
我喜欢的是能够定义让我们说一个范围,也就是说,如果df2 ['d'] - 5< = df1 ['d']< = df2 [' d'] + 5
它仍然可以,这意味着,两个数据框中的这些行都是要合并的候选项,只有当测试失败时,$ code> df1 填充了 NaN (像上面的例子)。
这样在几个步骤中,我可以模仿我的嵌套for-for循环,希望这将是更快?
任何建议/提示/示例将不胜感激。
谢谢
解决方案熊猫目前缺乏对附近查询的直接支持,虽然我有一个拉请求,以添加一些基本功能(不足以用于您的用例)。
幸运的是,科学的Python生态系统为您提供了自己需要的工具。
有效的方式加入近y位置是使用树状数据结构,如 scikit-learn文档。 SciPy和scikit学习都有合适的KDTree实现。
使用完全临时规则并不容易(或高效),但您可以有效地执行最近邻居查找只要你有一个明确的距离度量。我相信scikit学习的KDTree甚至可以让您定义自己的距离度量,但是我们将坚持正常的欧几里得距离来继续你的例子:
导入pandas作为pd
#对于df2中的每一行,我们要加入df1中最近的行
#根据列d
join_cols = ['d']
tree = KDTree(df1 [join_cols])
distance,indices = tree.query(df2 [ join_cols])
df1_near_2 = df1.take(indices).reset_index(drop = True)
left = df1_near_2.rename(columns = lambda l:'x_'+ l)
right = df2.rename(columns = lambda l:'y_'+ l)
merged = pd.concat([left,right],axis = 1)
/ pre>
这导致:
x_a x_b x_c x_d x_e y_a y_b y_c y_d y_e
0 56 38 11 10 65 56 38 11 10 65
1 31 12 26 99 71 145 108 123 84 3
2 56 38 11 10 6 5 113 144 183 7 12
3 31 12 26 99 71 144 60 46 106 148
4 31 12 26 99 71 57 87 51 95 187
5 31 12 26 99 71 41 12 26 99 71
6 70 49 40 227 1 80 49 40 227 1
7 3 85 32 46 70 3 85 32 46 70
8 56 38 11 10 65 107 95 81 15 25
9 56 38 11 10 65 138 97 38 28 171
如果要根据多个列,就像设置
join_cols = ['d','e','f']
一样简单。I have 2 large datasets (large in terms of 70K to 110K each). I want to correlate/compare both and find which items from set2 can be found in set1 based on some conditions/criteria.
My current strategy is to sort both lists by common fields and then run nested
for
loops, perform conditionalif
tests, aggregate predefined dict with items which were found and those that did not match.Example:
import pandas as pd list1 = [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 65}, {'a': 31, 'b': '12', 'c': '26', 'd': '99', 'e': 71}, {'a': 70, 'b': '49', 'c': '40', 'd': '227', 'e': 1}, {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 70},] list2 = [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 65}, {'a': 145, 'b': '108', 'c': '123', 'd': '84', 'e': 3}, {'a': 113, 'b': '144', 'c': '183', 'd': '7', 'e': 12}, {'a': 144, 'b': '60', 'c': '46', 'd': '106', 'e': 148}, {'a': 57, 'b': '87', 'c': '51', 'd': '95', 'e': 187}, {'a': 41, 'b': '12', 'c': '26', 'd': '99', 'e': 71}, {'a': 80, 'b': '49', 'c': '40', 'd': '227', 'e': 1}, {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 70}, {'a': 107, 'b': '95', 'c': '81', 'd': '15', 'e': 25}, {'a': 138, 'b': '97', 'c': '38', 'd': '28', 'e': 171}] re_dict = dict([('found', []), ('alien', [])]) for L2 in list2: for L1 in list1: if (L1['a']-5 <= L2['a'] <= L2['a']+10) and L2['c'][-1:] in L1['c'][-1:]: if (65 <= L2['e'] <= 75): L2.update({'e': 'some value'}) re_dict['found'].append(L2) list1.remove(L1) break # break out from the inner loop else: # if the inner loop traversed entire list, there were no matches re_dict['alien'].append(L2)
Above yields desired results:
re_dict {'alien': [{'a': 145, 'b': '108', 'c': '123', 'd': '84', 'e': 3}, {'a': 113, 'b': '144', 'c': '183', 'd': '7', 'e': 12}, {'a': 57, 'b': '87', 'c': '51', 'd': '95', 'e': 187}, {'a': 41, 'b': '12', 'c': '26', 'd': '99', 'e': 71}, {'a': 107, 'b': '95', 'c': '81', 'd': '15', 'e': 25}, {'a': 138, 'b': '97', 'c': '38', 'd': '28', 'e': 171}], 'found': [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 'some value'}, {'a': 144, 'b': '60', 'c': '46', 'd': '106', 'e': 148}, {'a': 80, 'b': '49', 'c': '40', 'd': '227', 'e': 1}, {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 'some value'}]}
So it does the job, but is obviously not very efficient and seems like ideal job for
pandas
.I think it would be ideal if I could merge/join two
DataFrames
, but I can't figure it out how to merge on the complex criterion. Also my datasets are not equal in size.Example:
df1 = pd.DataFrame(list1) df2 = pd.DataFrame(list2) pd.merge(df1,df2,on='d',how='outer') a_x b_x c_x d e_x a_y b_y c_y e_y 0 56 38 11 10 65 56 38 11 65 1 31 12 26 99 71 41 12 26 71 2 70 49 40 227 1 80 49 40 1 3 3 85 32 46 70 3 85 32 70 4 NaN NaN NaN 84 NaN 145 108 123 3 5 NaN NaN NaN 7 NaN 113 144 183 12 6 NaN NaN NaN 106 NaN 144 60 46 148 7 NaN NaN NaN 95 NaN 57 87 51 187 8 NaN NaN NaN 15 NaN 107 95 81 25 9 NaN NaN NaN 28 NaN 138 97 38 171
It merges only when say d column is exactly equal in both
df1
anddf2
. What I prefer is to be able to define lets say a range, that is ifdf2['d']-5 <= df1['d'] <= df2['d']+5
it's still ok and it means, that these lines in both dataframes are candidates to be merged, only if test failsdf1
columns are filled with NaN (like in above example).This way in several steps I could mimic my nested for-for loops, and hopefully that would be quicker?
Any suggestion/hint/example would be greatly appreciated.
Thanks
解决方案pandas currently lacks direct support for "nearby" queries, though I have a pull request up to add some basic functionality (not enough for your use-case).
Fortunately, the scientific Python ecosystem gives you the tools you need to do this yourself.
The efficient way to join on nearby locations is to use a tree data structure, as described nicely in the scikit-learn documentation. Both SciPy and scikit-learn have suitable KDTree implementations.
It's not easy (or efficient) to use entirely ad-hoc rules, but you can do nearest neighbor lookups efficiently as long as you have a well defined distance metric. I believe scikit-learn's KDTree even lets you define your own distance metric, but we'll stick to normal Euclidean distance to continue your example:
from scipy.spatial import cKDTree as KDTree import pandas as pd # for each row in df2, we want to join the nearest row in df1 # based on the column "d" join_cols = ['d'] tree = KDTree(df1[join_cols]) distance, indices = tree.query(df2[join_cols]) df1_near_2 = df1.take(indices).reset_index(drop=True) left = df1_near_2.rename(columns=lambda l: 'x_' + l) right = df2.rename(columns=lambda l: 'y_' + l) merged = pd.concat([left, right], axis=1)
This results in:
x_a x_b x_c x_d x_e y_a y_b y_c y_d y_e 0 56 38 11 10 65 56 38 11 10 65 1 31 12 26 99 71 145 108 123 84 3 2 56 38 11 10 65 113 144 183 7 12 3 31 12 26 99 71 144 60 46 106 148 4 31 12 26 99 71 57 87 51 95 187 5 31 12 26 99 71 41 12 26 99 71 6 70 49 40 227 1 80 49 40 227 1 7 3 85 32 46 70 3 85 32 46 70 8 56 38 11 10 65 107 95 81 15 25 9 56 38 11 10 65 138 97 38 28 171
If you want to merge based on nearness for multiple columns, it's as simple as setting
join_cols = ['d', 'e', 'f']
.这篇关于按复合条件合并/加入2个DataFrames的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!