dplyr查找表/模式匹配 [英] dplyr lookup table / pattern matching

查看:158
本文介绍了dplyr查找表/模式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找一种巧妙的方法,以便在tidyverse中使用查找表,但找不到令人满意的解决方案。

I was looking for a smart, or "tidier" way, to make use of a lookup table in the tidyverse, but could not find a satisfying solution.

我有一个数据集和查找表:

I have a dataset and lookup table:

# Sample data
data <- data.frame(patients = 1:5,
                   treatment = letters[1:5],
                   hospital = c("yyy", "yyy", "zzz", "www", "uuu"),
                   response = rnorm(5))

# Lookup table
lookup <- tibble(hospital = c("yyy", "uuu"), patients = c(1,5))

...其中查找表中的每一行都是我要为其过滤第一行的确切模式tibble(数据)。

... where each row in the lookup table is the exact pattern for which I want to filter the first tibble (data).

所需结果如下:

# A tibble: 3 x 4
  patients treatment hospital response
     <dbl> <chr>     <chr>       <dbl>
1     1.00 a         yyy       -0.275 
2     5.00 e         uuu       -0.0967

The我想出的最简单的解决方案是这样的:

The easiest solution I came up with is something like this:

as.tibble(dat) %>% 
  filter(paste(hospital, patients) %in% paste(lookup$hospital, lookup$patients))

但是,这一定是很多人经常做的事情-是否有一种更简洁,更便捷的方法(例如,查询表中有两列以上)?

However, this must be something that a lot of people regularly do - is there a cleaner and more convienent way to do this (i.e. for more than two columns in your lookup table)?

推荐答案

由于 dplyr :: inner_join()的默认行为是在传递给的两个小标题之间的公共列上匹配函数和查找表仅包含2个键列,最短的代码如下:

Since the default behavior of dplyr::inner_join() is to match on common columns between the two tibbles passed to the function and the lookup table consists of only the 2 key columns, the shortest code is as follows:

library(dplyr)

# Sample data
data <- tibble(patients = 1:5,
                   treatment = letters[1:5],
                   hospital = c("yyy", "yyy", "zzz", "www", "uuu"),
                   response = rnorm(5))

# Lookup table
lookup <- tibble(hospital = c("yyy", "uuu"), patients = c(1,5))

data %>% inner_join(.,lookup)

...并输出:

> data %>% inner_join(.,lookup)
Joining, by = c("patients", "hospital")
# A tibble: 2 x 4
  patients treatment hospital response
     <dbl> <chr>     <chr>       <dbl>
1     1.00 a         yyy        -1.44 
2     5.00 e         uuu        -0.313
>

因为所需的输出可以通过对整个小标题的键列进行联接来实现,所以<$ c OP中不需要$ c> paste()代码。

Because the desired output can be accomplished by a join on key columns across the tibbles, the paste() code in the OP is unnecessary.

还请注意, inner_join()是正确的联接类型,因为所需的输出是在两者之间都匹配的行传入的小标题,并且查找表没有重复的行。如果查询表包含重复的行,则根据OP上的注释, semi_join()将是合适的函数。

Also note that inner_join() is the right type of join because the desired output is rows that match across both incoming tibbles, and the lookup table does not have duplicate rows. If the lookup table contained duplicate rows, then semi_join() would be the appropriate function, per the comments on the OP.

这篇关于dplyr查找表/模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆