如何通过python / pandas中另一个数据框的值，以最有效的方式标记数据帧的列？ [英] How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?

查看：123 发布时间：2017/3/26 1:47:07 python performance select pandas dataframe

本文介绍了如何通过python / pandas中另一个数据框的值，以最有效的方式标记数据帧的列？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据帧A（约500k个记录）。它包含两列：fromTimestamp和toTimestamp。

我有一个数据框B（〜5M记录）。它有一些值和一个名为actualTimestamp的列。

我想要在数据帧B中的所有行，其中actualTimestamp的值在任何fromTimestamp和toTimestamp对被标记。

我想要类似的东西，但效率更高的代码：

  for index，行在A.iterrows（）中：
 cond1 = B ['actual_timestamp']> = row ['from_timestamp'] 
 cond2 = B ['actual_timestamp']< = row ['to_timestamp'] 
 B.ix [cond1& cond2，'corrupted_flag'] = True

在python中最快/最有效的方法是什么/ pandas？

更新：
样本数据

数据框A （输入）：

  from_timestamp to_timestamp 
 3 4 
 6 9 
 8 10

数据框B（输入）：

  data actual_timestamp 
a 2 
b 3 
c 4 
d 5 
e 8 
f 10 
g 11 
h 12

数据框B（预期输出）：

  data actual_timestamp corrupted_flag 
a 2 False 
b 3 True 
c 4 True 
d 5 False 
e 8 True 
f 10 True 
g 11 False 
h 12 False

解决方案

您可以使用 intervaltree 包以构建一个间隔树，然后检查DataFrame B中的每个时间戳记是否在树中：

 来自intervaltree import IntervalTree 
 
 tree = IntervalTree.from_tuples（zip（A ['from_timestamp']） A ['to_timestamp'] + 0.1））
 B ['corrupted_flag'] = B ['actual_timestamp']。map（lambda x：tree.overlaps（x））

请注意，您需要稍微贴上 A ['to_timestamp'] 作为上限的间隔不包括在 intervaltree 包中的间隔的一部分（尽管是下限）。

这种方法通过a改进了性能对我生成的一些样本数据（A = 10k行，B = 100k行），稍微超过 14 的因子。性能提升越多，我添加的行越多。

我已经使用 intervaltree 包与 datetime 以前的代码，所以上面的代码应该仍然有效，如果你的时间戳不是整数，就像你的样本数据一样;你可能需要改变填充上限的方式。

I've got a dataframe "A" (~500k records). It contains two columns: "fromTimestamp" and "toTimestamp".

I've got a dataframe "B" (~5M records). It has some values and a column named "actualTimestamp".

I want all of my rows in dataframe "B" where the value of "actualTimestamp" is between the values of any "fromTimestamp" and "toTimestamp" pair to be flagged.

I want something similar like this, but much more efficient code:

for index, row in A.iterrows():
    cond1 = B['actual_timestamp'] >= row['from_timestamp']
    cond2 = B['actual_timestamp'] <= row['to_timestamp']
    B.ix[cond1 & cond2, 'corrupted_flag'] = True

What is the fastest/most efficient way to do this in python/pandas?

Update: Sample data

dataframe A (input):

from_timestamp    to_timestamp
3                 4             
6                 9
8                 10

dataframe B (input):

data    actual_timestamp
a       2
b       3
c       4
d       5
e       8
f       10
g       11
h       12

dataframe B (expected output):

data    actual_timestamp   corrupted_flag
a       2                  False
b       3                  True
c       4                  True
d       5                  False
e       8                  True
f       10                 True
g       11                 False
h       12                 False

解决方案

You can use the intervaltree package to build an interval tree from the timestamps in DataFrame A, and then check if each timestamp from DataFrame B is in the tree:

from intervaltree import IntervalTree

tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))

Note that you need to pad A['to_timestamp'] slightly, as the upper bound of an interval is not included as part of the interval in the intervaltree package (although the lower bound is).

This method improved performance by a little more than a factor of 14 on some sample data I generated (A = 10k rows, B = 100k rows). The performance boost got bigger the more rows I added.

I've used the intervaltree package with datetime objects before, so the code above should still work if your timestamps aren't integers like they are in your sample data; you just might need to change how upper bounds are padded.

这篇关于如何通过python / pandas中另一个数据框的值，以最有效的方式标记数据帧的列？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何通过python / pandas中另一个数据框的值，以最有效的方式标记数据帧的列？ [英] How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何通过python / pandas中另一个数据框的值，以最有效的方式标记数据帧的列？ [英] How to flag the most efficient way a column of a dataframe by values of another dataframe&#39;s in python/pandas?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何通过python / pandas中另一个数据框的值，以最有效的方式标记数据帧的列？ [英] How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?

登录关闭