从DataFrames逐行合并行的更有效的方法是什么? [英] What's a more efficient way to merge rows from DataFrames row-by-row with conditions?
问题描述
我已经拼凑了一个玩具解决方案,让我用 itertuples
打开两个df,验证匹配然后重新打包成一个数据框:
df1:df2:
AXBY
0 1 10 0 2 10
1 5 15 1 4 15
2 6 15
。
df1 = pd.DataFrame(data1,columns = ['A','X'])
df2 = pd.DataFrame(data2,columns = ['B','Y'])
df3 = pd.DataFrame(index = ['A','X','B','Y'])$ b在df1.itertuples(index = False)中,$ bi = -1
for rowA(index = False):
i + = 1
在df2.itertuples中的rowB(index = False):
A,X = rowA
B,Y = rowB
if(B> A )& (X == Y):
df3 [i] = list(rowA + rowb)
else:
continue
print(df3.transpose())
。
AXBY
0 1 10 2 10
1 5 15 6 15
我的天真的方法是无效的
嵌套的 for()
循环是无效的,因为我对data1的每个条目迭代data2 / df2。一旦与data2 / df2进行了很好的匹配,应该删除该行。
//更新(显示我的问题的起源)
m合并两个不共享任何密钥或其他序列化ID的独立系统。由于我无法完全匹配,我必须依靠逻辑/算术运算和消除过程。
在下面的例子中,一个简单的 pandas.merge
在Line3上失败,因为Time1<时间2
Time1,Total1 ... Time2,Total2,error
1,2017-02-19 08:03: 00,15.00 ... 2017-02-19 08:02:00,15.00,0
2,2017-02-19 08:28:00,33.00 ... 2017-02-19 08:27: 00,33.00,0
3,2017-02-19 08:40:00,20.00 ... 2017-02-19 10:06:00,20.00,1
4,2017-02- 19 10:08:00,20.00 ... 2017-02-19 10:16:00,20.00,1
[...]
应该如何发生这样的事情:
Time1,Total1 .. 。Time2,Total2,error
1,2017-02-19 08:03:00,15.00 ... 2017-02-19 08:02:00,15.00,0
2,2017-02 -19 08:28:00,33.00 ... 2017-02-19 08:27:00,33.00,0
3,2017-02-19 08:40:00,20.00 ... NaN,NaN ,NaN
4,2017-02-19 10:08:00,20.00 ... 2017-02-19 10:06:00,20.00,0
[...]
// UPDATE2
我已经处理了几个排列的 merge_asof()
和 join()
在答案中推荐。每个方法也按照docs的指导排序。假设我已经正确地实现了这些,那么以下百分比是 True
匹配的规则((time1> = time2)&(Total1 ==总共2个)53个记录)在我的测试集中使用以下三种方法之一:
类型| 'date'| 总|两个|
| ----------------------- | ---------- | ---------- - | -------- |
| merg_asof sort(time)| .7924 | .9245 | .7169 |
| merg_asof(时间,总计)| .7735 | .6981 | .6226 |
| intertup(时间,总数)| .8301 | .8301 | .8301 |
|加入ind(时间)| na | na | na |
加入需要共享密钥,对吗?文档中的子句中的表示,调用者中的列在其他索引上加入,否则连接索引索引,如果倍数给定的列,传递的DataFrame必须有一个MultiIndex。
我尝试了加入
指数(时间,总数)和正好(时间)。问题是,加入,无论你加入什么。没有什么可以执行错误分析,因为这些索引被合并成一个。
我天真的 intertuple
解决方案以上)只产生完美的比赛,但解决方案仍然需要一个收藏家错过比赛。
如果我理解你的逻辑正确的是,应该这样做:
time1 = pd.to_datetime(['2/19/17 8:03:00' ,'2/19/17 8:28:00','2/19/17 8:40:00','2/19/17 10:08:00'])
time2 = pd.to_datetime (['2/19/17 8:02:00','2/19/17 8:27:00','2/19/17 10:06:00','2/19/17 10:16 :00'])
df1 = pd.DataFrame({'Time1':time1,'Total1':[15.00,33.00,20.00,20.00]})
df2 = pd.DataFrame ({'Time2':time2,'Total2':[15.00,33.00,20.00,20.00],'error':[0,0,1,1]}
df3 = pd.merge_asof (df1,df2,left_on ='Time1',right_on ='Time2')
df3.loc [df3 ['Time2']。duplicated(),['Time2','To tal2','error']] =无
输出:
Time1 Total1 Time2 Total2 error
0 2017-02-19 08:03:00 15.0 2017-02-19 08:02:00 15.0 0.0
1 2017-02-19 08:28:00 33.0 2017-02-19 08:27:00 33.0 0.0
2 2017-02-19 08:40:00 20.0 NaT NaN NaN
3 2017-02-19 10:08:00 20.0 2017-02-19 10:06:00 20.0 1.0
I'm joining two tables with data from two systems. A simple Pandas merge between two df won't honor more complex rules (unless I'm using it wrong, don't understand the process merge is implementing--very possible).
I've cobbled together a toy solution that lets me unpack two df's with itertuples
, validate matches based on values, and then repack into one dataframe:
df1: df2:
A X B Y
0 1 10 0 2 10
1 5 15 1 4 15
2 6 15
.
df1 = pd.DataFrame(data1,columns=['A','X'])
df2 = pd.DataFrame(data2,columns=['B','Y'])
df3 = pd.DataFrame(index=['A','X','B','Y'])
i = -1
for rowA in df1.itertuples(index=False):
i += 1
for rowB in df2.itertuples(index=False):
A,X = rowA
B,Y = rowB
if (B > A) & (X==Y):
df3[i] = list(rowA+rowb)
else:
continue
print(df3.transpose())
.
A X B Y
0 1 10 2 10
1 5 15 6 15
My naive approach is inefficient
The nested for()
loop is inefficient because I'm iterating over data2/df2 for each entry of data1. Once I get a good match with data2/df2, the row should be removed.
//UPDATE (show the origin of my question)
An example of the type of data I'm working with merges two independent systems which do not share any keys or other serialized IDs. Since I can't make an exact match, I must rely on logical/arithmetic operations and the process of elimination.
In the following example a simple pandas.merge
fails on Line3, because the Time1 < Time2.
Time1, Total1 ... Time2, Total2, error
1, 2017-02-19 08:03:00, 15.00 ... 2017-02-19 08:02:00, 15.00, 0
2, 2017-02-19 08:28:00, 33.00 ... 2017-02-19 08:27:00, 33.00, 0
3, 2017-02-19 08:40:00, 20.00 ... 2017-02-19 10:06:00, 20.00, 1
4, 2017-02-19 10:08:00, 20.00 ... 2017-02-19 10:16:00, 20.00, 1
[...]
What should happen is something like this:
Time1, Total1 ... Time2, Total2, error
1, 2017-02-19 08:03:00, 15.00 ... 2017-02-19 08:02:00, 15.00, 0
2, 2017-02-19 08:28:00, 33.00 ... 2017-02-19 08:27:00, 33.00, 0
3, 2017-02-19 08:40:00, 20.00 ... NaN, NaN, NaN
4, 2017-02-19 10:08:00, 20.00 ... 2017-02-19 10:06:00, 20.00, 0
[...]
// UPDATE2
I've worked on several permutations of merge_asof()
and join()
recommended in answers. Each method was also sorted as directed by docs. Assuming I've implemented each correctly, the following percentages are True
matches of the rules ((time1>=time2) & (Total1==Total2) out of 53 records) in my test set using each of three methods:
| type | 'date' | 'total' | both |
|-----------------------|----------|-----------|--------|
| merg_asof sort (time) | .7924 | .9245 | .7169 |
| merg_asof (time,total)| .7735 | .6981 | .6226 |
| intertup (time,total) | .8301 | .8301 | .8301 |
| join ind (time) | na | na | na |
The join required a shared key, right? the on
clause in the documentation states, "Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex."
I tried join
with a multi-index of (time,total) and just (time). The problem is, the join clobbers whatever you join on. There's nothing left to perform the error analysis on because those indexes are merged into one.
My naive intertuple
solution (above) produced only perfect matches, but the solution still needs a collector for missed matches.
If I'm understanding your logic correctly, this should do it:
time1 = pd.to_datetime(['2/19/17 8:03:00', '2/19/17 8:28:00', '2/19/17 8:40:00', '2/19/17 10:08:00'])
time2 = pd.to_datetime(['2/19/17 8:02:00', '2/19/17 8:27:00', '2/19/17 10:06:00', '2/19/17 10:16:00'])
df1 = pd.DataFrame({'Time1':time1, 'Total1':[15.00, 33.00, 20.00, 20.00]})
df2 = pd.DataFrame({'Time2':time2, 'Total2':[15.00, 33.00, 20.00, 20.00], 'error':[0,0,1,1]})
df3 = pd.merge_asof(df1, df2, left_on = 'Time1', right_on = 'Time2')
df3.loc[df3['Time2'].duplicated(), ['Time2', 'Total2', 'error']] = None
Output:
Time1 Total1 Time2 Total2 error
0 2017-02-19 08:03:00 15.0 2017-02-19 08:02:00 15.0 0.0
1 2017-02-19 08:28:00 33.0 2017-02-19 08:27:00 33.0 0.0
2 2017-02-19 08:40:00 20.0 NaT NaN NaN
3 2017-02-19 10:08:00 20.0 2017-02-19 10:06:00 20.0 1.0
这篇关于从DataFrames逐行合并行的更有效的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!