dplyr left_join小于,大于条件 [英] dplyr left_join by less than, greater than condition
问题描述
这个问题与有效地合并有关的问题有些相关两个数据框架是非平凡的标准和检查日期是否在r 之间的两个日期之间。而且我在这里发布的请求是否存在该功能:
GitHub issue
我正在使用 dplyr :: left_join()
加入两个数据框。我使用的条件是小于,比 <=
和>
更大。 dplyr :: left_join()
是否支持此功能?或者这些键只能在它们之间使用 =
操作符。这是直接从SQL运行(假设我有数据库中的数据框)
这是一个MWE:我有两个数据集一个公司年( fdata
),其次是每5年发生一次调查数据。因此,在两年内的 fdata
中的所有年份,我加入了相应的调查年度数据。
id < - c(1,1,1,1,
2,2,2,2,2,2,
3,3,3, 3,3,3,
5,5,5,5,
8,8,8,8,
13,13,13)
fyear< ; - c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
1998,1999,2000,2001,1998,1999,2000)
byear< - c(1990,1995,2000,2005)
eyear< - c(1995, 2000,2005,2010)
val < - c(3,1,5,6)
sdata< - tbl_df(data.frame(byear,eyear,val))
fdata< - tbl_df(data.frame(id,fyear))
test1 < - left_join(fdata,sdata,by = c(fyear> = byear,fyear<eyear))
我得到
错误:无法加入列'TRUE'x'TRUE':索引出边界
除非
left_join
可以处理该条件,但是我的语法缺少某些东西?解决方案使用
过滤器
。 (但请注意,这个答案不产生正确的LEFT JOIN
;但是MWE给出了正确的结果,一个INNER JOIN
代替。)
如果要求合并两个,
dplyr
表没有合并的东西,所以在下面我为这个目的在两个表中做一个虚拟变量,然后过滤,然后删除dummy
:fdata%>%
请注意,如果您在PostgreSQL中执行此操作(例如),查询优化程序将通过
mutate(dummy = TRUE)%>%
left_join(sdata%>%mutate dummy = TRUE))%>%
过滤器(fyear> = byear,fyear< eyear)%>%
选择(-dummy)
dummy
变量由以下两个查询说明证明:> fdata%>%
+ mutate(dummy = TRUE)%>%
+ left_join(sdata%>%mutate(dummy = TRUE))%>%
+ fyear> = byear,fyear< eyear)%>%
+ select(-dummy)%>%
+ explain()
加入:dummy
< SQL>
SELECTidASid,fyearASfyear,byearASbyear,eyearASeyear,valASval
FROM SELECT * FROM(SELECTid,fyear,TRUE ASdummy
FROMfdata)ASzzz136
LEFT JOIN
SELECTbyear,eyear,val,TRUE ASdummy
FROMsdata)ASzzz137
USING(dummy))ASzzz138
WHEREfyear> =byearANDfyear< eyear
< PLAN>
嵌套循环(cost = 0.00..50886.88 rows = 322722 width = 40)
加入过滤器:((fdata.fyear> = sdata.byear)AND(fdata.fyear< sdata.eyear) )
- > fdata上的Seq扫描(cost = 0.00..28.50 rows = 1850 width = 16)
- >物化(成本= 0.00..33.55行= 1570宽= 24)
- > Seq Scan on sdata(cost = 0.00..25.70 rows = 1570 width = 24)
并做更精细地使用SQL,使得完全相同的结果
> tbl(pg,sql(
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear> = byear AND fyear< eyear))%> ;%
+ explain()
< SQL>
SELECTid,fyear,byear,eyear,val
FROM(
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear> = byear AND fyear< eyear)ASzzz140
< PLAN>
嵌套循环左连接(cost = 0.00..50886.88 rows = 322722 width = 40)
加入过滤器:((fdata.fyear> = sdata.byear)AND(fdata.fyear< sdata。眼睛))
- > fdata上的Seq扫描(cost = 0.00..28.50 rows = 1850 width = 16)
- >物化(成本= 0.00..33.55行= 1570宽= 24)
- > Seq Scan on sdata(cost = 0.00..25.70 rows = 1570 width = 24)
This question is somewhat related to issues Efficiently merging two data frames on a non-trivial criteria and Checking if date is between two dates in r. And the one I have posted here requesting if the feature exist: GitHub issue
I am looking to join two dataframes using
dplyr::left_join()
. The condition I use to join is less-than, greater-than i.e,<=
and>
. Doesdplyr::left_join()
support this feature? or do the keys only take=
operator between them. This is straightforward to run from SQL (assuming I have the dataframe in the database)Here is a MWE: I have two datasets one firm-year (
fdata
), while second is sort of survey data that happens once every five years. So for all years in thefdata
that are in between two survey years, I join the corresponding survey year data.id <- c(1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3, 5,5,5,5, 8,8,8,8, 13,13,13) fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003, 1998,1999,2000,2001,2002,2003,1998,1999,2000,2001, 1998,1999,2000,2001,1998,1999,2000) byear <- c(1990,1995,2000,2005) eyear <- c(1995,2000,2005,2010) val <- c(3,1,5,6) sdata <- tbl_df(data.frame(byear, eyear, val)) fdata <- tbl_df(data.frame(id, fyear)) test1 <- left_join(fdata, sdata, by = c("fyear" >= "byear","fyear" < "eyear"))
I get
Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds
Unless if
left_join
can handle the condition, but my syntax is missing something?解决方案Use a
filter
. (But note that this answer does not produce a correctLEFT JOIN
; but the MWE gives the right result with anINNER JOIN
instead.)The
dplyr
package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then dropdummy
:fdata %>% mutate(dummy=TRUE) %>% left_join(sdata %>% mutate(dummy=TRUE)) %>% filter(fyear >= byear, fyear < eyear) %>% select(-dummy)
And note that if you do this in PostgreSQL (for example), the query optimizer sees through the
dummy
variable as evidenced by the following two query explanations:> fdata %>% + mutate(dummy=TRUE) %>% + left_join(sdata %>% mutate(dummy=TRUE)) %>% + filter(fyear >= byear, fyear < eyear) %>% + select(-dummy) %>% + explain() Joining by: "dummy" <SQL> SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val" FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy" FROM "fdata") AS "zzz136" LEFT JOIN (SELECT "byear", "eyear", "val", TRUE AS "dummy" FROM "sdata") AS "zzz137" USING ("dummy")) AS "zzz138" WHERE "fyear" >= "byear" AND "fyear" < "eyear" <PLAN> Nested Loop (cost=0.00..50886.88 rows=322722 width=40) Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear)) -> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16) -> Materialize (cost=0.00..33.55 rows=1570 width=24) -> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
and doing it more cleanly with SQL gives exactly the same result:
> tbl(pg, sql(" + SELECT * + FROM fdata + LEFT JOIN sdata + ON fyear >= byear AND fyear < eyear")) %>% + explain() <SQL> SELECT "id", "fyear", "byear", "eyear", "val" FROM ( SELECT * FROM fdata LEFT JOIN sdata ON fyear >= byear AND fyear < eyear) AS "zzz140" <PLAN> Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40) Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear)) -> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16) -> Materialize (cost=0.00..33.55 rows=1570 width=24) -> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
这篇关于dplyr left_join小于,大于条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!