从DataFrames逐行合并行的更有效的方法是什么? [英] What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

查看:199
本文介绍了从DataFrames逐行合并行的更有效的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在加入两个表,其中包含两个系统的数据。两个df之间的一个简单的Pandas 合并将不会遵守更复杂的规则(除非我使用错误,不明白进程合并正在实现 - 非常可能)。



我已经拼凑了一个玩具解决方案,让我用 itertuples 打开两个df,验证匹配然后重新打包成一个数据框:

  df1:df2:
AXBY
0 1 10 0 2 10
1 5 15 1 4 15
2 6 15

  df1 = pd.DataFrame(data1,columns = ['A','X'])
df2 = pd.DataFrame(data2,columns = ['B','Y'])
df3 = pd.DataFrame(index = ['A','X','B','Y'])$ b在df1.itertuples(index = False)中,$ bi = -1

for rowA(index = False):
i + = 1
在df2.itertuples中的rowB(index = False):
A,X = rowA
B,Y = rowB
if(B> A )& (X == Y):
df3 [i] = list(rowA + rowb)
else:
continue

print(df3.transpose())

  AXBY 
0 1 10 2 10
1 5 15 6 15

我的天真的方法是无效的



嵌套的 for()循环是无效的,因为我对data1的每个条目迭代data2 / df2。一旦与data2 / df2进行了很好的匹配,应该删除该行。



//更新(显示我的问题的起源)



m合并两个不共享任何密钥或其他序列化ID的独立系统。由于我无法完全匹配,我必须依靠逻辑/算术运算和消除过程。



在下面的例子中,一个简单的 pandas.merge 在Line3上失败,因为Time1<时间2

  Time1,Total1 ... Time2,Total2,error 
1,2017-02-19 08:03: 00,15.00 ... 2017-02-19 08:02:00,15.00,0
2,2017-02-19 08:28:00,33.00 ... 2017-02-19 08:27: 00,33.00,0
3,2017-02-19 08:40:00,20.00 ... 2017-02-19 10:06:00,20.00,1
4,2017-02- 19 10:08:00,20.00 ... 2017-02-19 10:16:00,20.00,1
[...]

应该如何发生这样的事情:

  Time1,Total1 .. 。Time2,Total2,error 
1,2017-02-19 08:03:00,15.00 ... 2017-02-19 08:02:00,15.00,0
2,2017-02 -19 08:28:00,33.00 ... 2017-02-19 08:27:00,33.00,0
3,2017-02-19 08:40:00,20.00 ... NaN,NaN ,NaN
4,2017-02-19 10:08:00,20.00 ... 2017-02-19 10:06:00,20.00,0
[...]






// UPDATE2
我已经处理了几个排列的 merge_asof() join()在答案中推荐。每个方法也按照docs的指导排序。假设我已经正确地实现了这些,那么以下百分比是 True 匹配的规则((time1> = time2)&(Total1 ==总共2个)53个记录)在我的测试集中使用以下三种方法之一:

 类型| 'date'| 总|两个| 
| ----------------------- | ---------- | ---------- - | -------- |
| merg_asof sort(time)| .7924 | .9245 | .7169 |
| merg_asof(时间,总计)| .7735 | .6981 | .6226 |
| intertup(时间,总数)| .8301 | .8301 | .8301 |
|加入ind(时间)| na | na | na |

加入需要共享密钥,对吗?文档中的子句中的表示,调用者中的列在其他索引上加入,否则连接索引索引,如果倍数给定的列,传递的DataFrame必须有一个MultiIndex。



我尝试了加入指数(时间,总数)和正好(时间)。问题是,加入,无论你加入什么。没有什么可以执行错误分析,因为这些索引被合并成一个。



我天真的 intertuple 解决方案以上)只产生完美的比赛,但解决方案仍然需要一个收藏家错过比赛。

解决方案

如果我理解你的逻辑正确的是,应该这样做:

  time1 = pd.to_datetime(['2/19/17 8:03:00' ,'2/19/17 8:28:00','2/19/17 8:40:00','2/19/17 10:08:00'])
time2 = pd.to_datetime (['2/19/17 8:02:00','2/19/17 8:27:00','2/19/17 10:06:00','2/19/17 10:16 :00'])

df1 = pd.DataFrame({'Time1':time1,'Total1':[15.00,33.00,20.00,20.00]})
df2 = pd.DataFrame ({'Time2':time2,'Total2':[15.00,33.00,20.00,20.00],'error':[0,0,1,1]}

df3 = pd.merge_asof (df1,df2,left_on ='Time1',right_on ='Time2')
df3.loc [df3 ['Time2']。duplicated(),['Time2','To tal2','error']] =无

输出:

  Time1 Total1 Time2 Total2 error 
0 2017-02-19 08:03:00 15.0 2017-02-19 08:02:00 15.0 0.0
1 2017-02-19 08:28:00 33.0 2017-02-19 08:27:00 33.0 0.0
2 2017-02-19 08:40:00 20.0 NaT NaN NaN
3 2017-02-19 10:08:00 20.0 2017-02-19 10:06:00 20.0 1.0


I'm joining two tables with data from two systems. A simple Pandas merge between two df won't honor more complex rules (unless I'm using it wrong, don't understand the process merge is implementing--very possible).

I've cobbled together a toy solution that lets me unpack two df's with itertuples, validate matches based on values, and then repack into one dataframe:

df1:            df2:
   A   X           B   Y
0  1  10        0  2  10
1  5  15        1  4  15
                2  6  15

.

df1 = pd.DataFrame(data1,columns=['A','X'])
df2 = pd.DataFrame(data2,columns=['B','Y'])
df3 = pd.DataFrame(index=['A','X','B','Y'])
i = -1

for rowA in df1.itertuples(index=False):
    i += 1
    for rowB in df2.itertuples(index=False):
        A,X = rowA
        B,Y = rowB
        if (B > A) & (X==Y):
            df3[i] = list(rowA+rowb)
        else:
            continue

print(df3.transpose())

.

   A   X  B   Y
0  1  10  2  10
1  5  15  6  15

My naive approach is inefficient

The nested for() loop is inefficient because I'm iterating over data2/df2 for each entry of data1. Once I get a good match with data2/df2, the row should be removed.

//UPDATE (show the origin of my question)

An example of the type of data I'm working with merges two independent systems which do not share any keys or other serialized IDs. Since I can't make an exact match, I must rely on logical/arithmetic operations and the process of elimination.

In the following example a simple pandas.merge fails on Line3, because the Time1 < Time2.

   Time1,               Total1 ... Time2,               Total2, error
1, 2017-02-19 08:03:00, 15.00  ... 2017-02-19 08:02:00,  15.00, 0
2, 2017-02-19 08:28:00, 33.00  ... 2017-02-19 08:27:00,  33.00, 0
3, 2017-02-19 08:40:00, 20.00  ... 2017-02-19 10:06:00,  20.00, 1
4, 2017-02-19 10:08:00, 20.00  ... 2017-02-19 10:16:00,  20.00, 1
[...]

What should happen is something like this:

   Time1,               Total1 ... Time2,               Total2, error
1, 2017-02-19 08:03:00, 15.00  ... 2017-02-19 08:02:00,  15.00, 0
2, 2017-02-19 08:28:00, 33.00  ... 2017-02-19 08:27:00,  33.00, 0
3, 2017-02-19 08:40:00, 20.00  ... NaN,                  NaN,   NaN
4, 2017-02-19 10:08:00, 20.00  ... 2017-02-19 10:06:00,  20.00, 0
[...]


// UPDATE2 I've worked on several permutations of merge_asof() and join() recommended in answers. Each method was also sorted as directed by docs. Assuming I've implemented each correctly, the following percentages are True matches of the rules ((time1>=time2) & (Total1==Total2) out of 53 records) in my test set using each of three methods:

| type                  | 'date'   | 'total'   | both   |
|-----------------------|----------|-----------|--------|
| merg_asof sort (time) | .7924    | .9245     | .7169  |
| merg_asof (time,total)| .7735    | .6981     | .6226  |
| intertup (time,total) | .8301    | .8301     | .8301  |
| join ind (time)       | na       | na        | na     |

The join required a shared key, right? the on clause in the documentation states, "Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex."

I tried join with a multi-index of (time,total) and just (time). The problem is, the join clobbers whatever you join on. There's nothing left to perform the error analysis on because those indexes are merged into one.

My naive intertuple solution (above) produced only perfect matches, but the solution still needs a collector for missed matches.

解决方案

If I'm understanding your logic correctly, this should do it:

time1 = pd.to_datetime(['2/19/17 8:03:00', '2/19/17 8:28:00', '2/19/17 8:40:00', '2/19/17 10:08:00'])
time2 = pd.to_datetime(['2/19/17 8:02:00', '2/19/17 8:27:00', '2/19/17 10:06:00', '2/19/17 10:16:00'])

df1 = pd.DataFrame({'Time1':time1, 'Total1':[15.00, 33.00, 20.00, 20.00]})
df2 = pd.DataFrame({'Time2':time2, 'Total2':[15.00, 33.00, 20.00, 20.00], 'error':[0,0,1,1]})

df3 = pd.merge_asof(df1, df2, left_on = 'Time1', right_on = 'Time2')
df3.loc[df3['Time2'].duplicated(), ['Time2', 'Total2', 'error']] = None

Output:

                Time1  Total1               Time2  Total2  error
0 2017-02-19 08:03:00    15.0 2017-02-19 08:02:00    15.0    0.0
1 2017-02-19 08:28:00    33.0 2017-02-19 08:27:00    33.0    0.0
2 2017-02-19 08:40:00    20.0                 NaT     NaN    NaN
3 2017-02-19 10:08:00    20.0 2017-02-19 10:06:00    20.0    1.0

这篇关于从DataFrames逐行合并行的更有效的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆