R foverlaps 在 Python 中等效 [英] R foverlaps equivalent in Python
问题描述
我正在尝试用 Python 重写一些 R 代码,但无法通过某一特定代码位.我发现 R 中的 foverlaps
函数在执行基于时间的连接时非常有用,但没有发现任何在 Python3 中也能正常工作的函数.
I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps
function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3.
我正在做的是连接两个数据表,其中一个表中的 time
介于另一个表中的 start_time
和 end_time
之间.两个表的周期不一样 - table_A
每秒发生一次,每个间隔可以有多个条目,而 table_B
每 0 一个条目 -不定期间隔 10 分钟.
What I am doing is joining two data tables where the time
in one table falls between the start_time
and end_time
in another table. The periodicity of the two tables is not the same - table_A
occurs on a per second basis and can have multiple entries at each interval, while table_B
will have one entry every 0 - 10 minutes at irregular intervals.
这个问题和我想问的很相似:合并一个值介于另外两个值之间的熊猫数据框一个>
This question is very similar to what I am asking: Merge pandas dataframes where one value is between two others
以下代码提供了我想要的 R 输出:
The following code provides my desired output in R:
# Add dummy column to use with foverlaps
table_A <- table_A[, dummy := time]
# Set keys
setkey(table_B, x, y, start_time, end_time)
setkey(table_A, x, y, time, dummy)
# Join tables based on time
joined <- foverlaps(table_A, table_B, type = "within", by.x=c("x", "y", "time", "dummy"), by.y=c("x", "y", "start_time", "end_time"), nomatch=0L)[, dummy := NULL]
> head(table_A)
time x y dummy
1: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
2: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
3: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
4: 2016-07-11 11:52:27 4077 1 2016-07-11 11:52:27
5: 2016-07-11 11:52:32 4077 1 2016-07-11 11:52:32
6: 2016-07-11 11:52:32 4077 1 2016-07-11 11:52:32
> head(table_B)
x y start_time end_time
1: 6183 1 2016-07-11 12:00:45 2016-07-11 12:00:56
2: 6183 1 2016-07-11 12:01:20 2016-07-11 12:01:20
3: 6183 1 2016-07-11 12:01:40 2016-07-11 12:03:26
4: 6183 1 2016-07-11 12:04:20 2016-07-11 12:04:40
5: 6183 1 2016-07-11 12:04:55 2016-07-11 12:04:57
6: 6183 1 2016-07-11 12:05:40 2016-07-11 12:05:51
因此,table_A 中时间介于 start_time 和 end_time 之间的任何行都将与 table_B 中的相应行连接,给出如下输出.我在 Python 中尝试了很多不同的东西,但还没有找到解决方案.
So, any row in table_A where time falls between start_time and end_time will be joined with the corresponding row in table_B, giving an output such as below. I've tried many different things in Python, but haven't found the solution yet.
从示例数据中可能不明显的一件事是多个 x 和 y 值出现在同一 start_time
和 end_time
times 内代码>s.
One thing that may not be apparent from the example data is that multiple x and y values occur at time
s within the same start_time
and end_time
s.
> head(joined)
y x start_time end_time time
1 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
2 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
3 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
4 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
5 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:46
6 1 4077 2016-07-11 12:00:45 2016-07-11 12:00:56 2016-07-11 12:00:55
推荐答案
考虑使用 pandas.Series.between()
.合并连接连接列的所有组合,子集保留与时间间隔对齐的行.
Consider a straightforward merge with subset using pandas.Series.between()
. Merge joins all combinations of the join columns and the subset keeps rows that align to time intervals.
df = pd.merge(table_A, table_B, on=['x', 'y'])
df = df[df['time'].between(df['start_time'], df['end_time'], inclusive=True)]
<小时>
但是,一个重要的项目是您的日期应该被转换为日期时间类型.目前,您的帖子会显示影响 .between()
上方的字符串日期.下面假设美国日期以月份为首的 MM/DD/YYYY
.您可以在文件读入期间转换类型:
However, one important item is your dates should be casted as datetime type. Currently, your post shows string dates which affects above .between()
. Below assumes US dates with month first as MM/DD/YYYY
. Either you can convert types during file read in:
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M:%S')
table_A = pd.read_csv('data.csv', parse_dates=[0], date_parser=dateparse, dayfirst=False)
table_B = pd.read_csv('data.csv', parse_dates=[0,1], date_parser=dateparse, dayfirst=False)
或读入后:
table_A['time'] = pd.to_datetime(table_A['time'], format='%m/%d/%Y %H:%M:%S')
table_B['start_time'], table_B['end_time']=(pd.to_datetime(ser, format='%m/%d/%Y %H:%M:%S')
for ser in [table_B['start_time'], table_B['end_time']])
这篇关于R foverlaps 在 Python 中等效的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!