R:基于多个变量的多个值的子集数据帧 [英] R: Subset data frame based on multiple values for multiple variables
问题描述
我需要根据特定日期,ID号,事件开始时间和事件结束的组合,从第一个数据集(称为 df1
)中提取记录时间与第二个数据集匹配( df2
)。当只有1个日期,ID和事件开始和结束时间时,一切正常,但数据集之间的一些匹配记录包含多个ID,日期或时间,并且我无法从<$ c在这些情况下,$ c> df1 可以正确地进行子集。我最终想把它放在一个FOR循环或独立的函数中,因为我有一个相当大的数据集。这是我到目前为止:
I need to pull records from a first data set (called df1
here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2
). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1
to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:
我刚刚通过匹配两个数据集之间的日期,如下所示:
I started just by matching the dates between the two data sets as follows:
match_dates <- as.character(intersect(df1$Date, df2$Date))
然后我根据第一个匹配日期选择 df2
中的记录,同时保留其他列,以便我有其他ID时间信息我需要:
Then I selected the records in df2
based on the first matching date, also keeping the other columns so I have the other ID and time information I need:
records <- df2[which(df2$Date == match_dates[1]), ]
从记录的日期,ID,开始和结束时间
然后:
[1] "01-04-2009" "599091" "12:00" "17:21"
最后我子集 df1
for根据记录
中的日期,ID和时间,并将它们组合成一个新的数据框架,名为 final
获取我最终需要的 df1
中包含的数据。
Finally I subset df1
for before and after the event based on the date, ID, and times in records
and combined them into a new data frame called final
to get at the data contained in df1
that I ultimately need.
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)
这是真正的问题 - 一些匹配的日期在 df2
,并返回多个ID或次数。以下是多个记录的例子:
Here's the real problem - some of the matching dates have more than 1 corresponding row in df2
, and return multiple IDs or times. Here is what an example of multiple records looks like:
records <- df2[which(df2$Date == match_dates[25]), ]
> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"
当我尝试基于这个子集 df1
时,我会收到一个错误:
When I try to subset df1
based on this I get an error:
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
longer object length is not a multiple of shorter object length
2: In Date == records$Date :
longer object length is not a multiple of shorter object length
3: In Time < records$Start :
longer object length is not a multiple of shorter object length
尝试手动执行每个ID日期时间组合将是乏味的方式。我有9年的数据,所有的数据集之间的给定年份的多个匹配日期,所以理想情况下,我想将其设置为FOR循环,或一个FOR循环的函数,但我可以'不要超过这个。提前感谢任何提示!
Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!
推荐答案
如果你问我是什么我是filter()函数从 dplyr 包与 match 函数做你想要的。
If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.
> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
A B
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
A B
1 1 3
2 1 4
3 2 3
4 2 4
这篇关于R:基于多个变量的多个值的子集数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!