如何通过python / pandas中另一个数据框的值,以最有效的方式标记数据帧的列? [英] How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?

查看:123
本文介绍了如何通过python / pandas中另一个数据框的值,以最有效的方式标记数据帧的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据帧A(约500k个记录)。它包含两列:fromTimestamp和toTimestamp。



我有一个数据框B(〜5M记录)。它有一些值和一个名为actualTimestamp的列。



我想要在数据帧B中的所有行,其中actualTimestamp的值在任何fromTimestamp和toTimestamp对被标记。



我想要类似的东西,但效率更高的代码:

  for index,行在A.iterrows()中:
cond1 = B ['actual_timestamp']> = row ['from_timestamp']
cond2 = B ['actual_timestamp']< = row ['to_timestamp']
B.ix [cond1& cond2,'corrupted_flag'] = True

在python中最快/最有效的方法是什么/ pandas?



更新:
样本数据



数据框A (输入):

  from_timestamp to_timestamp 
3 4
6 9
8 10

数据框B(输入):

  data actual_timestamp 
a 2
b 3
c 4
d 5
e 8
f 10
g 11
h 12

数据框B(预期输出):

  data actual_timestamp corrupted_flag 
a 2 False
b 3 True
c 4 True
d 5 False
e 8 True
f 10 True
g 11 False
h 12 False


解决方案

您可以使用 intervaltree 包以构建一个间隔树,然后检查DataFrame B中的每个时间戳记是否在树中:

 来自intervaltree import IntervalTree 

tree = IntervalTree.from_tuples(zip(A ['from_timestamp']) A ['to_timestamp'] + 0.1))
B ['corrupted_flag'] = B ['actual_timestamp']。map(lambda x:tree.overlaps(x))

请注意,您需要稍微贴上 A ['to_timestamp'] 作为上限的间隔不包括在 intervaltree 包中的间隔的一部分(尽管是下限)。



这种方法通过a改进了性能对我生成的一些样本数据(A = 10k行,B = 100k行),稍微超过 14 的因子。性能提升越多,我添加的行越多。



我已经使用 intervaltree 包与 datetime 以前的代码,所以上面的代码应该仍然有效,如果你的时间戳不是整数,就像你的样本数据一样;你可能需要改变填充上限的方式。


I've got a dataframe "A" (~500k records). It contains two columns: "fromTimestamp" and "toTimestamp".

I've got a dataframe "B" (~5M records). It has some values and a column named "actualTimestamp".

I want all of my rows in dataframe "B" where the value of "actualTimestamp" is between the values of any "fromTimestamp" and "toTimestamp" pair to be flagged.

I want something similar like this, but much more efficient code:

for index, row in A.iterrows():
    cond1 = B['actual_timestamp'] >= row['from_timestamp']
    cond2 = B['actual_timestamp'] <= row['to_timestamp']
    B.ix[cond1 & cond2, 'corrupted_flag'] = True

What is the fastest/most efficient way to do this in python/pandas?

Update: Sample data

dataframe A (input):

from_timestamp    to_timestamp
3                 4             
6                 9
8                 10

dataframe B (input):

data    actual_timestamp
a       2
b       3
c       4
d       5
e       8
f       10
g       11
h       12

dataframe B (expected output):

data    actual_timestamp   corrupted_flag
a       2                  False
b       3                  True
c       4                  True
d       5                  False
e       8                  True
f       10                 True
g       11                 False
h       12                 False

解决方案

You can use the intervaltree package to build an interval tree from the timestamps in DataFrame A, and then check if each timestamp from DataFrame B is in the tree:

from intervaltree import IntervalTree

tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))

Note that you need to pad A['to_timestamp'] slightly, as the upper bound of an interval is not included as part of the interval in the intervaltree package (although the lower bound is).

This method improved performance by a little more than a factor of 14 on some sample data I generated (A = 10k rows, B = 100k rows). The performance boost got bigger the more rows I added.

I've used the intervaltree package with datetime objects before, so the code above should still work if your timestamps aren't integers like they are in your sample data; you just might need to change how upper bounds are padded.

这篇关于如何通过python / pandas中另一个数据框的值,以最有效的方式标记数据帧的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆