dplyr left_join 小于、大于条件 [英] dplyr left_join by less than, greater than condition

查看:34
本文介绍了dplyr left_join 小于、大于条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题与Efficiently merging关于非平凡标准的两个数据框检查日期是否在 r 中的两个日期之间.我在这里发布的一个请求是否存在该功能:GitHub 问题

我希望使用 dplyr::left_join() 连接两个数据帧.我用来连接的条件是小于、大于,即 <=>.dplyr::left_join() 支持这个功能吗?或者键只在它们之间使用 = 运算符.这很容易从 SQL 运行(假设我在数据库中有数据框)

这是一个 MWE:我有两个数据集,一个公司一年 (fdata),而第二个是每五年发生一次的调查数据.所以对于fdata中两个调查年之间的所有年份,我加入相应的调查年数据.

id <- c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,5,5,5,5,8,8,8,8,13,13,13)fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,1998、1999、2000、2001、2002、2003、1998、1999、2000、2001、1998,1999,2000,2001,1998,1999,2000)byear <- c(1990,1995,2000,2005)年 <- c(1995,2000,2005,2010)val <- c(3,1,5,6)sdata <- tbl_df(data.frame(byear, eyear, val))fdata <- tbl_df(data.frame(id, fyear))test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))

我明白了

<块引用>

错误:无法加入列 'TRUE' x 'TRUE':索引越界

除非left_join 可以处理这种情况,但我的语法缺少某些东西?

解决方案

使用 filter.(但请注意,此答案不会产生正确的LEFT JOIN;但 MWE 使用 INNER JOIN 给出了正确的结果.)

dplyr 包不满意如果被要求合并两个表而没有要合并的东西,所以在下面,我为此目的在两个表中创建了一个虚拟变量,然后过滤,然后删除虚拟:

fdata %>%变异(虚拟=真)%>%left_join(sdata %>% mutate(dummy=TRUE)) %>%filter(fyear >= byear, fyear < eyear) %>%选择(-虚拟)

请注意,如果您在 PostgreSQL 中执行此操作(例如),查询优化器会看穿 dummy 变量,如下两个查询解释所证明:

>f数据%>%+ 变异(虚拟=真)%>%+ left_join(sdata %>% mutate(dummy=TRUE)) %>%+ filter(fyear >= byear, fyear < eyear) %>%+ 选择(-虚拟)%>%+ 解释()加入者:哑巴"<SQL>SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"FROM "fdata") AS "zzz136"左加入(SELECT "byear", "eyear", "val", TRUE AS "dummy"FROM "sdata") AS "zzz137"USING ("dummy")) AS "zzz138"WHERE "fyear" >= "byear" AND "fyear" <一年"<计划>嵌套循环(成本=0.00..50886.88 行=322722 宽度=40)加入过滤器:((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))->fdata 上的 Seq 扫描(成本 = 0.00..28.50 行 = 1850 宽度 = 16)->实现(成本=0.00..33.55 行=1570 宽度=24)->sdata 上的 Seq 扫描(成本 = 0.00..25.70 行 = 1570 宽度 = 24)

使用 SQL 更简洁地执行完全相同的结果:

>tbl(pg, sql("+ 选择 *+ FROM fdata+ 左加入数据+ ON fyear >= byear AND fyear <年")) %>%+ 解释()<SQL>选择id"、fyear"、byear"、eyear"、val"从 (选择 *来自 fdata左连接数据ON fyear >= byear AND fyear <年)作为zzz140"<计划>嵌套循环左连接(成本=0.00..50886.88 行=322722 宽度=40)加入过滤器:((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))->fdata 上的 Seq 扫描(成本 = 0.00..28.50 行 = 1850 宽度 = 16)->实现(成本=0.00..33.55 行=1570 宽度=24)->sdata 上的 Seq 扫描(成本 = 0.00..25.70 行 = 1570 宽度 = 24)

This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteria and Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue

I am looking to join two dataframes using dplyr::left_join(). The condition I use to join is less-than, greater-than i.e, <= and >. Does dplyr::left_join() support this feature? or do the keys only take = operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)

Here is a MWE: I have two datasets one firm-year (fdata), while second is sort of survey data that happens once every five years. So for all years in the fdata that are in between two survey years, I join the corresponding survey year data.

id <- c(1,1,1,1,
        2,2,2,2,2,2,
        3,3,3,3,3,3,
        5,5,5,5,
        8,8,8,8,
        13,13,13)

fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
       1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
       1998,1999,2000,2001,1998,1999,2000)

byear <- c(1990,1995,2000,2005)
eyear <- c(1995,2000,2005,2010)
val <- c(3,1,5,6)

sdata <- tbl_df(data.frame(byear, eyear, val))

fdata <- tbl_df(data.frame(id, fyear))

test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))

I get

Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds 

Unless if left_join can handle the condition, but my syntax is missing something?

解决方案

Use a filter. (But note that this answer does not produce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOIN instead.)

The dplyr package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

fdata %>% 
    mutate(dummy=TRUE) %>%
    left_join(sdata %>% mutate(dummy=TRUE)) %>%
    filter(fyear >= byear, fyear < eyear) %>%
    select(-dummy)

And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy variable as evidenced by the following two query explanations:

> fdata %>% 
+     mutate(dummy=TRUE) %>%
+     left_join(sdata %>% mutate(dummy=TRUE)) %>%
+     filter(fyear >= byear, fyear < eyear) %>%
+     select(-dummy) %>%
+     explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"

LEFT JOIN 

(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"

USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"


<PLAN>
Nested Loop  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

and doing it more cleanly with SQL gives exactly the same result:

> tbl(pg, sql("
+     SELECT *
+     FROM fdata 
+     LEFT JOIN sdata 
+     ON fyear >= byear AND fyear < eyear")) %>%
+     explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
    SELECT *
    FROM fdata 
    LEFT JOIN sdata 
    ON fyear >= byear AND fyear < eyear) AS "zzz140"


<PLAN>
Nested Loop Left Join  (cost=0.00..50886.88 rows=322722 width=40)
  Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
  ->  Seq Scan on fdata  (cost=0.00..28.50 rows=1850 width=16)
  ->  Materialize  (cost=0.00..33.55 rows=1570 width=24)
        ->  Seq Scan on sdata  (cost=0.00..25.70 rows=1570 width=24)

这篇关于dplyr left_join 小于、大于条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆