R foverlaps 在 Python 中等效 [英] R foverlaps equivalent in Python

查看:29
本文介绍了R foverlaps 在 Python 中等效的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用 Python 重写一些 R 代码,但无法通过某一特定代码位.我发现 R 中的 foverlaps 函数在执行基于时间的连接时非常有用,但没有发现任何在 Python3 中也能正常工作的函数.

I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3.

我正在做的是连接两个数据表,其中一个表中的 time 介于另一个表中的 start_timeend_time 之间.两个表的周期不一样 - table_A 每秒发生一次,每个间隔可以有多个条目,而 table_B 每 0 一个条目 -不定期间隔 10 分钟.

What I am doing is joining two data tables where the time in one table falls between the start_time and end_time in another table. The periodicity of the two tables is not the same - table_A occurs on a per second basis and can have multiple entries at each interval, while table_B will have one entry every 0 - 10 minutes at irregular intervals.

这个问题和我想问的很相似:合并一个值介于另外两个值之间的熊猫数据框

This question is very similar to what I am asking: Merge pandas dataframes where one value is between two others

以下代码提供了我想要的 R 输出:

The following code provides my desired output in R:

# Add dummy column to use with foverlaps
table_A <- table_A[, dummy := time]

# Set keys
setkey(table_B, x, y, start_time, end_time)
setkey(table_A, x, y, time, dummy)

# Join tables based on time
joined <- foverlaps(table_A, table_B, type = "within", by.x=c("x", "y", "time", "dummy"), by.y=c("x", "y", "start_time", "end_time"), nomatch=0L)[, dummy := NULL]


> head(table_A)
   time                         x       y     dummy
1: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27 
2: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
3: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
4: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
5: 2016-07-11 11:52:32          4077    1     2016-07-11 11:52:32
6: 2016-07-11 11:52:32          4077    1     2016-07-11 11:52:32


> head(table_B)
                x       y   start_time              end_time
1:              6183    1   2016-07-11 12:00:45     2016-07-11 12:00:56 
2:              6183    1   2016-07-11 12:01:20     2016-07-11 12:01:20   
3:              6183    1   2016-07-11 12:01:40     2016-07-11 12:03:26  
4:              6183    1   2016-07-11 12:04:20     2016-07-11 12:04:40  
5:              6183    1   2016-07-11 12:04:55     2016-07-11 12:04:57  
6:              6183    1   2016-07-11 12:05:40     2016-07-11 12:05:51  

因此,table_A 中时间介于 start_time 和 end_time 之间的任何行都将与 table_B 中的相应行连接,给出如下输出.我在 Python 中尝试了很多不同的东西,但还没有找到解决方案.

So, any row in table_A where time falls between start_time and end_time will be joined with the corresponding row in table_B, giving an output such as below. I've tried many different things in Python, but haven't found the solution yet.

从示例数据中可能不明显的一件事是多个 x 和 y 值出现在同一 start_timeend_timetimes 内代码>s.

One thing that may not be apparent from the example data is that multiple x and y values occur at times within the same start_time and end_times.

> head(joined)
  y      x      start_time              end_time                time 
1 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
2 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
3 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
4 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
5 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
6 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:55 

推荐答案

考虑使用 pandas.Series.between().合并连接连接列的所有组合,子集保留与时间间隔对齐的行.

Consider a straightforward merge with subset using pandas.Series.between(). Merge joins all combinations of the join columns and the subset keeps rows that align to time intervals.

df = pd.merge(table_A, table_B, on=['x', 'y'])                   
df = df[df['time'].between(df['start_time'], df['end_time'], inclusive=True)]

<小时>

但是,一个重要的项目是您的日期应该被转换为日期时间类型.目前,您的帖子会显示影响 .between() 上方的字符串日期.下面假设美国日期以月份为首的 MM/DD/YYYY.您可以在文件读入期间转换类型:


However, one important item is your dates should be casted as datetime type. Currently, your post shows string dates which affects above .between(). Below assumes US dates with month first as MM/DD/YYYY. Either you can convert types during file read in:

dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M:%S')

table_A = pd.read_csv('data.csv', parse_dates=[0], date_parser=dateparse, dayfirst=False)

table_B = pd.read_csv('data.csv', parse_dates=[0,1], date_parser=dateparse, dayfirst=False)

或读入后:

table_A['time'] = pd.to_datetime(table_A['time'], format='%m/%d/%Y %H:%M:%S')

table_B['start_time'], table_B['end_time']=(pd.to_datetime(ser, format='%m/%d/%Y %H:%M:%S') 
                                    for ser in [table_B['start_time'], table_B['end_time']])

这篇关于R foverlaps 在 Python 中等效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆