如何通过python / pandas中另一个数据框的值,以最有效的方式标记数据帧的列? [英] How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?
问题描述
我有一个数据框B(〜5M记录)。它有一些值和一个名为actualTimestamp的列。
我想要在数据帧B中的所有行,其中actualTimestamp的值在任何fromTimestamp和toTimestamp对被标记。
我想要类似的东西,但效率更高的代码:
for index,行在A.iterrows()中:
cond1 = B ['actual_timestamp']> = row ['from_timestamp']
cond2 = B ['actual_timestamp']< = row ['to_timestamp']
B.ix [cond1& cond2,'corrupted_flag'] = True
在python中最快/最有效的方法是什么/ pandas?
更新:
样本数据
数据框A (输入):
from_timestamp to_timestamp
3 4
6 9
8 10
数据框B(输入):
data actual_timestamp
a 2
b 3
c 4
d 5
e 8
f 10
g 11
h 12
数据框B(预期输出):
data actual_timestamp corrupted_flag
a 2 False
b 3 True
c 4 True
d 5 False
e 8 True
f 10 True
g 11 False
h 12 False
您可以使用 intervaltree
包以构建一个间隔树,然后检查DataFrame B中的每个时间戳记是否在树中:
来自intervaltree import IntervalTree
tree = IntervalTree.from_tuples(zip(A ['from_timestamp']) A ['to_timestamp'] + 0.1))
B ['corrupted_flag'] = B ['actual_timestamp']。map(lambda x:tree.overlaps(x))
请注意,您需要稍微贴上 A ['to_timestamp']
作为上限的间隔不包括在 intervaltree
包中的间隔的一部分(尽管是下限)。
这种方法通过a改进了性能对我生成的一些样本数据(A = 10k行,B = 100k行),稍微超过 14
的因子。性能提升越多,我添加的行越多。
我已经使用 intervaltree
包与 datetime
以前的代码,所以上面的代码应该仍然有效,如果你的时间戳不是整数,就像你的样本数据一样;你可能需要改变填充上限的方式。
I've got a dataframe "A" (~500k records). It contains two columns: "fromTimestamp" and "toTimestamp".
I've got a dataframe "B" (~5M records). It has some values and a column named "actualTimestamp".
I want all of my rows in dataframe "B" where the value of "actualTimestamp" is between the values of any "fromTimestamp" and "toTimestamp" pair to be flagged.
I want something similar like this, but much more efficient code:
for index, row in A.iterrows():
cond1 = B['actual_timestamp'] >= row['from_timestamp']
cond2 = B['actual_timestamp'] <= row['to_timestamp']
B.ix[cond1 & cond2, 'corrupted_flag'] = True
What is the fastest/most efficient way to do this in python/pandas?
Update: Sample data
dataframe A (input):
from_timestamp to_timestamp
3 4
6 9
8 10
dataframe B (input):
data actual_timestamp
a 2
b 3
c 4
d 5
e 8
f 10
g 11
h 12
dataframe B (expected output):
data actual_timestamp corrupted_flag
a 2 False
b 3 True
c 4 True
d 5 False
e 8 True
f 10 True
g 11 False
h 12 False
You can use the intervaltree
package to build an interval tree from the timestamps in DataFrame A, and then check if each timestamp from DataFrame B is in the tree:
from intervaltree import IntervalTree
tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))
Note that you need to pad A['to_timestamp']
slightly, as the upper bound of an interval is not included as part of the interval in the intervaltree
package (although the lower bound is).
This method improved performance by a little more than a factor of 14
on some sample data I generated (A = 10k rows, B = 100k rows). The performance boost got bigger the more rows I added.
I've used the intervaltree
package with datetime
objects before, so the code above should still work if your timestamps aren't integers like they are in your sample data; you just might need to change how upper bounds are padded.
这篇关于如何通过python / pandas中另一个数据框的值,以最有效的方式标记数据帧的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!