Python中的R Foverlaps等效项 [英] R foverlaps equivalent in Python
问题描述
我正在尝试用Python重写一些R代码,并且无法通过某些特定的代码.我发现R中的foverlaps
函数在执行基于时间的联接时非常有用,但是还没有发现在Python3中能正常工作的任何东西.
I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps
function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3.
我正在做的是联接两个数据表,其中一个表中的time
介于另一表中的start_time
和end_time
之间.这两个表的周期性不同-table_A
每秒发生一次,并且每个间隔可以有多个条目,而table_B
每0-10分钟以不规则的间隔有一个条目.
What I am doing is joining two data tables where the time
in one table falls between the start_time
and end_time
in another table. The periodicity of the two tables is not the same - table_A
occurs on a per second basis and can have multiple entries at each interval, while table_B
will have one entry every 0 - 10 minutes at irregular intervals.
这个问题与我要问的非常相似: 合并其中一个值介于两个值之间的熊猫数据框
This question is very similar to what I am asking: Merge pandas dataframes where one value is between two others
以下代码在R中提供了我想要的输出:
The following code provides my desired output in R:
# Add dummy column to use with foverlaps
table_A <- table_A[, dummy := time]
# Set keys
setkey(table_B, x, y, start_time, end_time)
setkey(table_A, x, y, time, dummy)
# Join tables based on time
joined <- foverlaps(table_A, table_B, type = "within", by.x=c("x", "y", "time", "dummy"), by.y=c("x", "y", "start_time", "end_time"), nomatch=0L)[, dummy := NULL]
> head(table_A)
time x y dummy
1: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
2: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
3: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
4: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
5: 2016-07-11 11:52:32 4077 1 2016-07-11 11:52:32
6: 2016-07-11 11:52:32 4077 1 2016-07-11 11:52:32
> head(table_B)
x y start_time end_time
1: 6183 1 2016-07-11 12:00:45 2016-07-11 12:00:56
2: 6183 1 2016-07-11 12:01:20 2016-07-11 12:01:20
3: 6183 1 2016-07-11 12:01:40 2016-07-11 12:03:26
4: 6183 1 2016-07-11 12:04:20 2016-07-11 12:04:40
5: 6183 1 2016-07-11 12:04:55 2016-07-11 12:04:57
6: 6183 1 2016-07-11 12:05:40 2016-07-11 12:05:51
因此,table_A中时间介于start_time和end_time之间的任何行将与table_B中的相应行合并,给出如下输出.我已经在Python中尝试了许多不同的方法,但是还没有找到解决方案.
So, any row in table_A where time falls between start_time and end_time will be joined with the corresponding row in table_B, giving an output such as below. I've tried many different things in Python, but haven't found the solution yet.
从示例数据可能看不到的一件事是,多个x和y值在同一start_time
和end_time
内的time
s处出现.
One thing that may not be apparent from the example data is that multiple x and y values occur at time
s within the same start_time
and end_time
s.
> head(joined)
y x start_time end_time time
1 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
2 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
3 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
4 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
5 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
6 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:55
推荐答案
Consider a straightforward merge with subset using pandas.Series.between()
. Merge joins all combinations of the join columns and the subset keeps rows that align to time intervals.
df = pd.merge(table_A, table_B, on=['x', 'y'])
df = df[df['time'].between(df['start_time'], df['end_time'], inclusive=True)]
但是,一个重要的事项是您的日期应强制转换为datetime类型.目前,您的帖子显示了影响.between()
以上的字符串日期.以下假设美国日期(月份首位为MM/DD/YYYY
).您可以在文件读入期间转换类型:
However, one important item is your dates should be casted as datetime type. Currently, your post shows string dates which affects above .between()
. Below assumes US dates with month first as MM/DD/YYYY
. Either you can convert types during file read in:
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M:%S')
table_A = pd.read_csv('data.csv', parse_dates=[0], date_parser=dateparse, dayfirst=False)
table_B = pd.read_csv('data.csv', parse_dates=[0,1], date_parser=dateparse, dayfirst=False)
或读完后:
table_A['time'] = pd.to_datetime(table_A['time'], format='%m/%d/%Y %H:%M:%S')
table_B['start_time'], table_B['end_time']=(pd.to_datetime(ser, format='%m/%d/%Y %H:%M:%S') \
for ser in [table_B['start_time'], table_B['end_time']])
这篇关于Python中的R Foverlaps等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!