按复合条件合并/加入2个DataFrames [英] Merge/Join 2 DataFrames by complex criteria

查看:141
本文介绍了按复合条件合并/加入2个DataFrames的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个大数据集(大于70K到110K)。我想要相关/比较两者,并根据某些条件/条件找到set2中的哪些项目。



我目前的策略是按照常见的排序字段,然后为循环运行嵌套,执行条件如果测试,则会将预定义的词汇与已找到的项目进行聚合,不符合



示例:

 将大熊猫导入为pd 

list1 = [{'a':56,'b':'38','c':'11','d':'10','e':65},
{'a':31,'b':'12','c':'26','d':'99','e':71},
{'a' ,'b':'49','c':'40','d':'227','e':1},
{'a':3,'b' ,'c':'32','d':'46','e':70},]
list2 = [{'a':56,'b':'38' :'11','d':'10','e':65},
{'a':145,'b':'108','c' '123','d':'84','e':3},
{'a':113,'b':'144','c':'183','d' '7','e':12},
{'a':144,'b':'60','c':'46','d' 148},
{'a':57,'b':'87','c':'51','d':'95','e':187},
{ 'a':41,'b':'12','c':'26','d':'99','e':71},
{'a':80,'b ':'49','c':'40','d':'227','e':1},
{'a':3,'b':'85' ':'32','d':'46','e':70},
{'a':107,'b':'95','c' ':'15','e':25},
{'a':138,'b':'97','c':'38','d':'28' ':171}]

re_dict = dict([('found',[]),('alien',[])])

在列表2中的L2:
在list1中的L1:
if(L1 ['a'] - 5 <= L1 ['c'] [ - 1:]中的L2 ['a'] <= L2 ['a'] + 10)和L2 ['c'] [ - 1:]:
if(65 < = L2 ['e'] <= 75):
L2.update({'e':'some value'})
re_dict ['found']。append(L2)
list1.remove(L1)
break#从内部循环中释放
else:如果内部循环遍历整个列表,则没有匹配
re_dict ['alien'] .append(L2)

以上产生所需结果:

  re_dict 
{'alien':[{'a':145,'b':'108','c':'123','d' :'84','e':3},
{'a':113,'b':'144','c':'183','d' :12},
{'a':57,'b':'87','c':'51','d':'95','e':187},
{'a':41,'b':'12','c':'26','d':'99','e':71},
{'a' b':'95','c':'81' d':'15','e':25},
{'a':138,'b':'97','c':'38' e':171}],
'found':[{'a':56,'b':'38','c':'11','d':'10' :'some value'},
{'a':144,'b':'60','c':'46','d':'106','e' b $ b {'a':80,'b':'49','c':'40','d':'227','e':1},
{'a' 3,'b':'85','c':'32','d':'46','e':'some value'}]}
pre>

所以它做的工作,但显然不是很有效率,似乎是理想的工作,熊猫。 / p>

我认为如果我可以合并/加入两个 DataFrames 是理想的,但是我无法理解如何合并复杂的标准。另外我的数据集大小不一样。



示例:

  df1 = pd.DataFrame(list1)
df2 = pd.DataFrame(list2)

pd.merge(df1,df2,on ='d',how ='outer')
a_x b_x c_x d e_x a_y b_y c_y e_y
0 56 38 11 10 65 56 38 11 65
1 31 12 26 99 71 41 12 26 71
2 70 49 40 227 1 80 49 40 1
3 3 85 32 46 70 3 85 32 70
4 NaN NaN Na Na 84 NaN 145 108 123 3
5 NaN NaN NaN 7 NaN 113 144 183 12
6 NaN NaN NaN 106 NaN 144 60 46 148
7 NaN NaN NaN 95 NaN 57 87 51 187
8 NaN NaN NaN 15 NaN 107 95 81 25
9 NaN NaN NaN 28 NaN 138 97 38 171

只有当 d 列在两个 df1 df2
我喜欢的是能够定义让我们说一个范围,也就是说,如果 df2 ['d'] - 5< = df1 ['d']< = df2 [' d'] + 5 它仍然可以,这意味着,两个数据框中的这些行都是要合并的候选项,只有当测试失败时,$ code> df1
填充了 NaN (像上面的例子)。



这样在几个步骤中,我可以模仿我的嵌套for-for循环,希望这将是更快?



任何建议/提示/示例将不胜感激。



谢谢

解决方案

熊猫目前缺乏对附近查询的直接支持,虽然我有一个拉请求,以添加一些基本功能(不足以用于您的用例)。



幸运的是,科学的Python生态系统为您提供了自己需要的工具。



有效的方式加入近y位置是使用树状数据结构,如 scikit-learn文档。 SciPy和scikit学习都有合适的KDTree实现。



使用完全临时规则并不容易(或高效),但您可以有效地执行最近邻居查找只要你有一个明确的距离度量。我相信scikit学习的KDTree甚至可以让您定义自己的距离度量,但是我们将坚持正常的欧几里得距离来继续你的例子:

导入pandas作为pd 

#对于df2中的每一行,我们要加入df1中最近的行
#根据列d
join_cols = ['d']
tree = KDTree(df1 [join_cols])
distance,indices = tree.query(df2 [ join_cols])
df1_near_2 = df1.take(indices).reset_index(drop = True)

left = df1_near_2.rename(columns = lambda l:'x_'+ l)
right = df2.rename(columns = lambda l:'y_'+ l)
merged = pd.concat([left,right],axis = 1)
/ pre>

这导致:

  x_a x_b x_c x_d x_e y_a y_b y_c y_d y_e 
0 56 38 11 10 65 56 38 11 10 65
1 31 12 26 99 71 145 108 123 84 3
2 56 38 11 10 6 5 113 144 183 7 12
3 31 12 26 99 71 144 60 46 106 148
4 31 12 26 99 71 57 87 51 95 187
5 31 12 26 99 71 41 12 26 99 71
6 70 49 40 227 1 80 49 40 227 1
7 3 85 32 46 70 3 85 32 46 70
8 56 38 11 10 65 107 95 81 15 25
9 56 38 11 10 65 138 97 38 28 171

如果要根据多个列,就像设置 join_cols = ['d','e','f'] 一样简单。


I have 2 large datasets (large in terms of 70K to 110K each). I want to correlate/compare both and find which items from set2 can be found in set1 based on some conditions/criteria.

My current strategy is to sort both lists by common fields and then run nested for loops, perform conditional if tests, aggregate predefined dict with items which were found and those that did not match.

Example:

import pandas as pd

list1 = [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 65},
         {'a': 31, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
         {'a': 70, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
         {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 70},]
list2 = [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 65},
         {'a': 145, 'b': '108', 'c': '123', 'd': '84', 'e': 3},
         {'a': 113, 'b': '144', 'c': '183', 'd': '7', 'e': 12},
         {'a': 144, 'b': '60', 'c': '46', 'd': '106', 'e': 148},
         {'a': 57, 'b': '87', 'c': '51', 'd': '95', 'e': 187},
         {'a': 41, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
         {'a': 80, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
         {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 70},
         {'a': 107, 'b': '95', 'c': '81', 'd': '15', 'e': 25},
         {'a': 138, 'b': '97', 'c': '38', 'd': '28', 'e': 171}]

re_dict = dict([('found', []), ('alien', [])])

for L2 in list2:
    for L1 in list1:
        if (L1['a']-5 <= L2['a'] <= L2['a']+10) and L2['c'][-1:] in L1['c'][-1:]:
            if (65 <= L2['e'] <= 75):
                L2.update({'e': 'some value'})
            re_dict['found'].append(L2)
            list1.remove(L1)
            break # break out from the inner loop
    else: # if the inner loop traversed entire list, there were no matches
        re_dict['alien'].append(L2)

Above yields desired results:

re_dict
{'alien': [{'a': 145, 'b': '108', 'c': '123', 'd': '84', 'e': 3},
  {'a': 113, 'b': '144', 'c': '183', 'd': '7', 'e': 12},
  {'a': 57, 'b': '87', 'c': '51', 'd': '95', 'e': 187},
  {'a': 41, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
  {'a': 107, 'b': '95', 'c': '81', 'd': '15', 'e': 25},
  {'a': 138, 'b': '97', 'c': '38', 'd': '28', 'e': 171}],
 'found': [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 'some value'},
  {'a': 144, 'b': '60', 'c': '46', 'd': '106', 'e': 148},
  {'a': 80, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
  {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 'some value'}]}

So it does the job, but is obviously not very efficient and seems like ideal job for pandas.

I think it would be ideal if I could merge/join two DataFrames, but I can't figure it out how to merge on the complex criterion. Also my datasets are not equal in size.

Example:

df1 = pd.DataFrame(list1)
df2 = pd.DataFrame(list2)

pd.merge(df1,df2,on='d',how='outer')
   a_x  b_x  c_x    d  e_x  a_y  b_y  c_y  e_y
0   56   38   11   10   65   56   38   11   65
1   31   12   26   99   71   41   12   26   71
2   70   49   40  227    1   80   49   40    1
3    3   85   32   46   70    3   85   32   70
4  NaN  NaN  NaN   84  NaN  145  108  123    3
5  NaN  NaN  NaN    7  NaN  113  144  183   12
6  NaN  NaN  NaN  106  NaN  144   60   46  148
7  NaN  NaN  NaN   95  NaN   57   87   51  187
8  NaN  NaN  NaN   15  NaN  107   95   81   25
9  NaN  NaN  NaN   28  NaN  138   97   38  171

It merges only when say d column is exactly equal in both df1 and df2. What I prefer is to be able to define lets say a range, that is if df2['d']-5 <= df1['d'] <= df2['d']+5 it's still ok and it means, that these lines in both dataframes are candidates to be merged, only if test fails df1 columns are filled with NaN (like in above example).

This way in several steps I could mimic my nested for-for loops, and hopefully that would be quicker?

Any suggestion/hint/example would be greatly appreciated.

Thanks

解决方案

pandas currently lacks direct support for "nearby" queries, though I have a pull request up to add some basic functionality (not enough for your use-case).

Fortunately, the scientific Python ecosystem gives you the tools you need to do this yourself.

The efficient way to join on nearby locations is to use a tree data structure, as described nicely in the scikit-learn documentation. Both SciPy and scikit-learn have suitable KDTree implementations.

It's not easy (or efficient) to use entirely ad-hoc rules, but you can do nearest neighbor lookups efficiently as long as you have a well defined distance metric. I believe scikit-learn's KDTree even lets you define your own distance metric, but we'll stick to normal Euclidean distance to continue your example:

from scipy.spatial import cKDTree as KDTree
import pandas as pd

# for each row in df2, we want to join the nearest row in df1
# based on the column "d"
join_cols = ['d']
tree = KDTree(df1[join_cols])
distance, indices = tree.query(df2[join_cols])
df1_near_2 = df1.take(indices).reset_index(drop=True)

left = df1_near_2.rename(columns=lambda l: 'x_' + l)
right = df2.rename(columns=lambda l: 'y_' + l)
merged = pd.concat([left, right], axis=1)

This results in:

   x_a x_b x_c  x_d  x_e  y_a  y_b  y_c  y_d  y_e
0   56  38  11   10   65   56   38   11   10   65
1   31  12  26   99   71  145  108  123   84    3
2   56  38  11   10   65  113  144  183    7   12
3   31  12  26   99   71  144   60   46  106  148
4   31  12  26   99   71   57   87   51   95  187
5   31  12  26   99   71   41   12   26   99   71
6   70  49  40  227    1   80   49   40  227    1
7    3  85  32   46   70    3   85   32   46   70
8   56  38  11   10   65  107   95   81   15   25
9   56  38  11   10   65  138   97   38   28  171

If you want to merge based on nearness for multiple columns, it's as simple as setting join_cols = ['d', 'e', 'f'].

这篇关于按复合条件合并/加入2个DataFrames的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆