R:根据另一数据帧的条件对数据帧进行子集 [英] R: subset a data frame based on conditions from another data frame

查看:217
本文介绍了R:根据另一数据帧的条件对数据帧进行子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个我想解决的问题。说,我有两个数据框,如下所示:

Here is a problem I am trying to solve. Say, I have two data frames like the following:

observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
    time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
    measurement = rnorm(100,5,7))

sampletimes <- data.frame(location = letters[1:20], 
    id = rep(1:4,5),
    time1 = rep(c(2,7,12,17,22), each=4), 
    time2 = rep(c(4,9,14,19,24), each=4))

它们都包含一个名为 id 的列,链接数据帧。我想从测量 c c c c $ c> time1 time2 sampletimes 数据框。另外,我想将相应的位置连接到每个测量。

They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.

我已经成功地通过转换我的 sampletimes 到一个广泛的格式(即所有的 time1 time2 信息在一个将 id )中的两个数据帧合并,然后使用条件语句将其替换为 id 时间落在行中的至少一个时间间隔之间,然后将位置分配给适当的时间测量。

I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.

但是,我在观察中有大约200万行,这样做需要很长时间。我正在寻找一种更好的方式,我可以保留长格式的数据。示例数据集非常简单,但实际上,我的数据包含每个 id 的间隔和位置的可变数。

However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.

对于我们的例子,我希望得到的数据框将如下:

For our example, the data frame I would hope to get back would be as follows:

id time measurement letters[1:20]
1    3  10.5163892             a
2    3   5.5774119             b
3    3  10.5057060             c
4    3  14.1563179             d
1    8   2.2653761             e
2    8  -1.0905546             f
3    8  12.7434161             g
4    8  17.6129261             h
1   13  10.9234673             i
2   13   1.6974481             j
3   13  -0.3664951             k
4   13  13.8792198             l
1   18   6.5038847             m
2   18   1.2032935             n
3   18  15.0889469             o
4   18   0.8934357             p
1   23   3.6864527             q
2   23   0.2404074             r
3   23  11.6028766             s
4   23  20.7466908             t


推荐答案

不高效,但做这个工作:

Not efficient , but do the job :

 subset(merge(observations,sampletimes), time > time1 & time < time2)
        id time measurement location time1 time2
    11   1    3    3.180321        a     2     4
    47   1    8    6.040612        e     7     9
    83   1   13   -5.999317        i    12    14
    99   1   18    2.689414        m    17    19
    125  1   23   12.514722        q    22    24
    137  2    8    4.420679        f     7     9
    141  2    3   11.492446        b     2     4
    218  2   13    6.672506        j    12    14
    234  2   18   12.290339        n    17    19
    250  2   23   12.610828        r    22    24
    251  3    3    8.570984        c     2     4
    267  3    8   -7.112291        g     7     9
    283  3   13    6.287598        k    12    14
    360  3   23   11.941846        s    22    24
    364  3   18   -4.199001        o    17    19
    376  4    3    7.133370        d     2     4
    402  4    8   13.477790        h     7     9
    418  4   13    3.967293        l    12    14
    454  4   18   12.845535        p    17    19
    490  4   23   -1.016839        t    22    24

编辑

由于您有超过500万行,您应该尝试一个 data.table 解决方案:

Since you have more than 5 millions rows, you should give a try to a data.table solution:

library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

这篇关于R:根据另一数据帧的条件对数据帧进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆