使用R中的data.table基于不同的独立表进行匹配 [英] Matching based on different independent tables using data.table in R

查看:86
本文介绍了使用R中的data.table基于不同的独立表进行匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将多个条件从独立数据表匹配到我的主数据表. 如何使用data.table软件包执行此操作? 最有效/最快的方法是什么?

I would like to match multiple conditions from independent data tables onto my main data table. How can I do this using the data.table package? What would be the most efficient/fastest way?

我有一个模拟的例子,这里有一些模拟条件来说明我的问题:

I have a mock example, with some mock conditions here to illustrate my question:

main_data <- data.frame( pnum = c(1,2,3,4,5,6,7,8,9,10),
                         age = c(24,35,43,34,55,24,36,43,34,54),
                         gender = c("f","m","f","f","m","f","m","f","f","m"))

data_1 <- data.frame( pnum = c(1,4,5,8,9),
                      value_data_1 = c(1, 2, 1, 1, 1),
                      date = as.Date(c("2019-01-01", "2018-07-01", "2018-01-01", "2016-07-01", "2016-07-01")))

data_2 <- data.frame( pnum = c(1,5,7,8,9),
                      value_data_2 = c(1, 2, 1, 1, 2),
                      date = as.Date(c("2019-01-01", "2018-07-01", "2018-01-01", "2016-07-01", "2016-07-01")))

我想在我的main_data表中创建一个名为"matching"的新变量.多个条件下在data_1和data_2之间匹配的那些行中的行:

I would like to create a new variable in my main_data table called "matching" of those rows that match between data_1 and data_2 under multiple conditions:

  • 首先,data_1 $ value_data_1的值必须等于1.
  • 第二,data_2 $ value_data_2的值也必须等于1.
  • 第三,pnum和日期应在data_1和data_2之间匹配.

当所有这些条件都满足时,我希望main_data的新输出看起来像这样:

When all these conditions are met, I would expect the new output of main_data to look like this:

> main_data
   pnum age gender matching
1     1  24      f        1
2     2  35      m        0
3     3  43      f        0
4     4  34      f        0
5     5  55      m        0
6     6  24      f        0
7     7  36      m        0
8     8  43      f        1
9     9  34      f        0
10   10  54      m        0

到目前为止,我分别对每个条件进行了编程,并在两者之间创建了新的占位符表,但这在内存方面不是很有效.是否有一种有效的方法专门使用data.tables包来链接所有条件?

Until now, I programmed each condition seperately and created new placeholder tables in between, but this is not very memory efficient. Is there an efficient way to chain all the conditions using the data.tables package specifically?

推荐答案

您可以使用Reduce(merge, list(...))

library(data.table)

setDT(main_data); setDT(data_1); setDT(data_2)

res <- Reduce(function(x, y) {
  merge(x, y, by = "pnum", all.x = TRUE)
}, list(main_data, data_1[, -"date"], data_2[, -"date"]))[, `:=`(
  matching = 1L - (value_data_1 != 1 | value_data_2 != 1 | is.na(value_data_1) | is.na(value_data_2)), 
  value_data_1 = NULL,
  value_data_2 = NULL
)]

输出

> res[]
    pnum age gender matching
 1:    1  24      f        1
 2:    2  35      m        0
 3:    3  43      f        0
 4:    4  34      f        0
 5:    5  55      m        0
 6:    6  24      f        0
 7:    7  36      m        0
 8:    8  43      f        1
 9:    9  34      f        0
10:   10  54      m        0

这篇关于使用R中的data.table基于不同的独立表进行匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆