R:基于多个变量的多个值的子集数据帧 [英] R: Subset data frame based on multiple values for multiple variables

查看:110
本文介绍了R:基于多个变量的多个值的子集数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要根据特定日期,ID号,事件开始时间和事件结束的组合,从第一个数据集(称为 df1 )中提取记录时间与第二个数据集匹配( df2 )。当只有1个日期,ID和事件开始和结束时间时,一切正常,但数据集之间的一些匹配记录包含多个ID,日期或时间,并且我无法从<$ c在这些情况下,$ c> df1 可以正确地进行子集。我最终想把它放在一个FOR循环或独立的函数中,因为我有一个相当大的数据集。这是我到目前为止:

I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:

我刚刚通过匹配两个数据集之间的日期,如下所示:

I started just by matching the dates between the two data sets as follows:

match_dates <- as.character(intersect(df1$Date, df2$Date))

然后我根据第一个匹配日期选择 df2 中的记录,同时保留其他列,以便我有其他ID时间信息我需要:

Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:

records <- df2[which(df2$Date == match_dates[1]), ]

记录的日期,ID,开始和结束时间然后:

[1] "01-04-2009" "599091"     "12:00"      "17:21" 

最后我子集 df1 for根据记录中的日期,ID和时间,并将它们组合成一个新的数据框架,名为 final 获取我最终需要的 df1 中包含的数据。

Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.

before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)

这是真正的问题 - 一些匹配的日期在 df2 ,并返回多个ID或次数。以下是多个记录的例子:

Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:

records <- df2[which(df2$Date == match_dates[25]), ]

> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"

当我尝试基于这个子集 df1 时,我会收到一个错误:

When I try to subset df1 based on this I get an error:

before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
  longer object length is not a multiple of shorter object length
2: In Date == records$Date :
  longer object length is not a multiple of shorter object length
3: In Time < records$Start :
  longer object length is not a multiple of shorter object length

尝试手动执行每个ID日期时间组合将是乏味的方式。我有9年的数据,所有的数据集之间的给定年份的多个匹配日期,所以理想情况下,我想将其设置为FOR循环,或一个FOR循环的函数,但我可以'不要超过这个。提前感谢任何提示!

Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!

推荐答案

如果你问我是什么我是filter()函数从 dplyr 包与 match 函数做你想要的。

If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.

> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
   A B
1  1 1
2  1 2
3  1 3
4  1 4
5  2 1
6  2 2
7  2 3
8  2 4
9  3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
  A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
  A B
1 1 3
2 1 4
3 2 3
4 2 4

这篇关于R:基于多个变量的多个值的子集数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆