Python中的R Foverlaps等效项 [英] R foverlaps equivalent in Python

查看:203
本文介绍了Python中的R Foverlaps等效项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用Python重写一些R代码,并且无法通过某些特定的代码.我发现R中的foverlaps函数在执行基于时间的联接时非常有用,但是还没有发现在Python3中能正常工作的任何东西.

I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3.

我正在做的是联接两个数据表,其中一个表中的time介于另一表中的start_timeend_time之间.这两个表的周期性不同-table_A每秒发生一次,并且每个间隔可以有多个条目,而table_B每0-10分钟以不规则的间隔有一个条目.

What I am doing is joining two data tables where the time in one table falls between the start_time and end_time in another table. The periodicity of the two tables is not the same - table_A occurs on a per second basis and can have multiple entries at each interval, while table_B will have one entry every 0 - 10 minutes at irregular intervals.

这个问题与我要问的非常相似: 合并其中一个值介于两个值之间的熊猫数据框

This question is very similar to what I am asking: Merge pandas dataframes where one value is between two others

以下代码在R中提供了我想要的输出:

The following code provides my desired output in R:

# Add dummy column to use with foverlaps
table_A <- table_A[, dummy := time]

# Set keys
setkey(table_B, x, y, start_time, end_time)
setkey(table_A, x, y, time, dummy)

# Join tables based on time
joined <- foverlaps(table_A, table_B, type = "within", by.x=c("x", "y", "time", "dummy"), by.y=c("x", "y", "start_time", "end_time"), nomatch=0L)[, dummy := NULL]


> head(table_A)
   time                         x       y     dummy
1: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27 
2: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
3: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
4: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
5: 2016-07-11 11:52:32          4077    1     2016-07-11 11:52:32
6: 2016-07-11 11:52:32          4077    1     2016-07-11 11:52:32


> head(table_B)
                x       y   start_time              end_time
1:              6183    1   2016-07-11 12:00:45     2016-07-11 12:00:56 
2:              6183    1   2016-07-11 12:01:20     2016-07-11 12:01:20   
3:              6183    1   2016-07-11 12:01:40     2016-07-11 12:03:26  
4:              6183    1   2016-07-11 12:04:20     2016-07-11 12:04:40  
5:              6183    1   2016-07-11 12:04:55     2016-07-11 12:04:57  
6:              6183    1   2016-07-11 12:05:40     2016-07-11 12:05:51  

因此,table_A中时间介于start_time和end_time之间的任何行将与table_B中的相应行合并,给出如下输出.我已经在Python中尝试了许多不同的方法,但是还没有找到解决方案.

So, any row in table_A where time falls between start_time and end_time will be joined with the corresponding row in table_B, giving an output such as below. I've tried many different things in Python, but haven't found the solution yet.

从示例数据可能看不到的一件事是,多个x和y值在同一start_timeend_time内的time s处出现.

One thing that may not be apparent from the example data is that multiple x and y values occur at times within the same start_time and end_times.

> head(joined)
  y      x      start_time              end_time                time 
1 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
2 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
3 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
4 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
5 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
6 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:55 

推荐答案

考虑使用

Consider a straightforward merge with subset using pandas.Series.between(). Merge joins all combinations of the join columns and the subset keeps rows that align to time intervals.

df = pd.merge(table_A, table_B, on=['x', 'y'])                   
df = df[df['time'].between(df['start_time'], df['end_time'], inclusive=True)]


但是,一个重要的事项是您的日期应强制转换为datetime类型.目前,您的帖子显示了影响.between()以上的字符串日期.以下假设美国日期(月份首位为MM/DD/YYYY).您可以在文件读入期间转换类型:


However, one important item is your dates should be casted as datetime type. Currently, your post shows string dates which affects above .between(). Below assumes US dates with month first as MM/DD/YYYY. Either you can convert types during file read in:

dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M:%S')

table_A = pd.read_csv('data.csv', parse_dates=[0], date_parser=dateparse, dayfirst=False)

table_B = pd.read_csv('data.csv', parse_dates=[0,1], date_parser=dateparse, dayfirst=False)

或读完后:

table_A['time'] = pd.to_datetime(table_A['time'], format='%m/%d/%Y %H:%M:%S')

table_B['start_time'], table_B['end_time']=(pd.to_datetime(ser, format='%m/%d/%Y %H:%M:%S') \
                                    for ser in [table_B['start_time'], table_B['end_time']])

这篇关于Python中的R Foverlaps等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆