pandas :根据另一个DF选择DF行 [英] Pandas: select DF rows based on another DF

查看:79
本文介绍了 pandas :根据另一个DF选择DF行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧(很长,每个都有数百行或数千行).其中一个名为df1,包含一个时间序列,间隔为10分钟.例如:

I've got two dataframes (very long, with hundreds or thousands of rows each). One of them, called df1, contains a timeseries, in intervals of 10 minutes. For example:


               date          value
2016-11-24 00:00:00    1759.199951
2016-11-24 00:10:00     992.400024
2016-11-24 00:20:00    1404.800049
2016-11-24 00:30:00      45.799999
2016-11-24 00:40:00      24.299999
2016-11-24 00:50:00     159.899994
2016-11-24 01:00:00      82.499999
2016-11-24 01:10:00      37.400003
2016-11-24 01:20:00     159.899994
....

另外一个df2包含日期时间间隔:

And the other one, df2, contains datetime intervals:


              start_date             end_date
0    2016-11-23 23:55:32  2016-11-24 00:14:03
1    2016-11-24 01:03:18  2016-11-24 01:07:12
2    2016-11-24 01:11:32  2016-11-24 02:00:00 
...

我需要选择df1中所有属于"df2"区间的行.

I need to select all the rows in df1 that "falls" into an interval in df2.

在这些示例中,结果数据框应为:

With these examples, the result dataframe should be:


               date          value
2016-11-24 00:00:00    1759.199951   # Fits in row 0 of df2
2016-11-24 00:10:00     992.400024   # Fits in row 0 of df2
2016-11-24 01:00:00      82.499999   # Fits in row 1 of df2
2016-11-24 01:10:00      37.400003   # Fits on row 2 of df2
2016-11-24 01:20:00     159.899994   # Fits in row 2 of df2
....

推荐答案

使用

Using np.searchsorted:

Here's a variation based on np.searchsorted that seems to be an order of magnitude faster than using intervaltree or merge, assuming my larger sample data is correct.

# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])

# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
    df1['date'].values <= s1['end_date'].values,
    df1['date_end'].values <= s2['end_date'].values,
    s1.index.values != s2.index.values
    ]

# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)

如果df2中的间隔是嵌套的或重叠的,则可能需要进行修改;在这种情况下,我还没有完全考虑过,但它可能仍然有效.

This may need to be modified if the intervals in df2 are nested or overlapping; I haven't fully thought it through in that scenario, but it may still work.

并不是一个纯粹的Pandas解决方案,但是您可能要考虑从中构建间隔树 df2,并根据您在df1中的间隔进行查询,以找到重叠的间隔.

Not quite a pure Pandas solution, but you may want to consider building an Interval Tree from df2, and querying it against your intervals in df1 to find the ones that overlap.

PyPI上的 intervaltree 包似乎具有良好的性能并且易于使用.

The intervaltree package on PyPI seems to have good performance and easy to use syntax.

from intervaltree import IntervalTree

# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

出于性能原因,我将日期转换为等效的整数.我怀疑intervaltree程序包是在考虑pd.Timestamp的基础上构建的,因此可能存在一些中间转换步骤,这些操作会使速度变慢.

I converted the dates to their integer equivalents for performance reasons. I doubt the intervaltree package was built with pd.Timestamp in mind, so there probably some intermediate conversion steps that slow things down a bit.

此外,请注意,尽管包含了起点,但intervaltree软件包中的间隔并不包括终点.这就是为什么在创建tree时使用+ [0, 1]的原因;我将端点填充了十亿分之一秒,以确保真正包含了真正的端点.这也是为什么我在查询树时添加pd.offsets.Minute(10)来获取间隔结束的原因,而不是仅添加9m 59s是很好的原因.

Also, note that intervals in the intervaltree package do not include the end point, although the start point is included. That's why I have the + [0, 1] when creating tree; I'm padding the end point by a nanosecond to make sure the real end point is actually included. It's also the reason why it's fine for me to add pd.offsets.Minute(10) to get the interval end when querying the tree, instead of adding only 9m 59s.

任一方法的结果输出:

                 date        value
0 2016-11-24 00:00:00  1759.199951
1 2016-11-24 00:10:00   992.400024
6 2016-11-24 01:00:00    82.499999
7 2016-11-24 01:10:00    37.400003
8 2016-11-24 01:20:00   159.899994

时间

使用以下设置来产生更大的样本数据:

Timings

Using the following setup to produce larger sample data:

# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})

# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})

# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2

df1df2产生以下内容:

df1
                  date     value
0     2016-11-24 00:00:00  0.444939
1     2016-11-24 00:10:00  0.407554
2     2016-11-24 00:20:00  0.460148
3     2016-11-24 00:30:00  0.465239
4     2016-11-24 00:40:00  0.462691
...
54995 2017-12-10 21:50:00  0.754123
54996 2017-12-10 22:00:00  0.401820
54997 2017-12-10 22:10:00  0.146284
54998 2017-12-10 22:20:00  0.394759
54999 2017-12-10 22:30:00  0.907233

df2
              start_date            end_date
0   2016-11-24 00:00:19 2016-11-24 00:41:24
1   2016-11-24 18:22:44 2016-11-24 18:36:44
2   2016-11-25 12:44:44 2016-11-25 13:03:13
3   2016-11-26 07:07:05 2016-11-26 07:49:29
4   2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53

并使用以下功能进行计时:

And using the following functions for timing purposes:

def root_searchsorted(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # Build the conditions that indicate an overlap (any True condition indicates an overlap).
    cond = [
        df1['date'].values <= s1['end_date'].values,
        df1['date_end'].values <= s2['end_date'].values,
        s1.index.values != s2.index.values
        ]

    # Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
    return df1[np.any(cond, axis=0)].drop('date_end', axis=1)

def root_intervaltree(df1, df2):
    # Build the Interval Tree.
    tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

    # Build the 10 minutes spans from df1.
    dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

    # Query the Interval Tree to filter the DataFrame.
    return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

def ptrj(df1, df2):
    # The smallest amount of time - handy when using open intervals:
    epsilon = pd.Timedelta(1, 'ns')

    # Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
    sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
    edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values

    return df1[mask]

def parfait(df1, df2):
    df1['key'] = 1
    df2['key'] = 1
    df2['row'] = df2.index.values

    # CROSS JOIN
    df3 = pd.merge(df1, df2, on=['key'])

    # DF FILTERING
    return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

def root_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----
    # Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.
    return df1.drop('date_end', axis=1)

我得到以下计时:

%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop

%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop

%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop

%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop

%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop

这篇关于 pandas :根据另一个DF选择DF行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆