使用 data.table 模糊连接两个数据帧 [英] fuzzyjoin two data frames using data.table

查看:16
本文介绍了使用 data.table 模糊连接两个数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究 fuzzyjoin 以将 2 个数据帧连接在一起,但是由于内存问题,连接导致 无法分配...的内存.所以我正在尝试使用 data.table 加入数据.数据示例如下.

I have been working on a fuzzyjoin to join 2 data frames together however due to memory issues the join causes cannot allocate memory of…. So I am trying to join the data using data.table. A sample of the data is below.

df1 看起来像:

        ID     f_date               ACCNUM    flmNUM start_date   end_date
1    50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30
2  1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19
3   804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05
4  1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13
5  1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20
6   858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24
7     2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17
8  1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03
9  1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10
10   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20

df2 看起来像:

     ID       date fyear     at     lt
1 50341 1998-12-31  1998 104382  94973
2 50341 1999-12-31  1999 190692 175385
3 50341 2000-12-31  2000 179519 163347
4 50341 2001-12-31  2001 203638 186030
5 50341 2002-12-31  2002 190453 173620
6 50341 2003-12-31  2003 200235 181955

我将重点关注 ID = 50341.如果 df2$datedf1$start_datedf1$end_date 的时间段内,则将它们连接在一起.所以这里 df2$date = 2002-12-31df1 开始 2002-09-07 之间并结束 2003-08-30,因此加入这一行.

I will focus on the ID = 50341. If df2$date is in the time period of df1$start_date and df1$end_date then join them together. So here df2$date = 2002-12-31 which is in between df1 start 2002-09-07 and end 2003-08-30, therefore join this row.

我运行以下代码并得到相应的输出:

I run the following code and get the corresponding output:

df1$f_date <- as.Date(df1$f_date)
df2$date <- as.Date(df2$date)

df1$start_date <- df1$f_date + 183
df1$end_date <- df1$f_date + 540

library(fuzzyjoin)
final_data <- fuzzy_left_join(
  df1, df2,
  by = c(
    "ID" = "ID",
    "start_date" = "date",
    "end_date" = "date"
  ),
  match_fun = list(`==`, `<`, `>=`)
)

final_data

输出:

      ID.x     f_date               ACCNUM    flmNUM start_date   end_date    ID.y       date fyear         at         lt
1    50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30   50341 2002-12-31  2002 190453.000 173620.000
2  1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19 1067983 2010-12-31  2010 372229.000 209295.000
3   804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05  804753 2004-12-31  2004    982.265    383.614
4  1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13 1090727 2013-12-31  2013  36212.000  29724.000
5  1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20 1467858 2010-12-31  2010 138898.000 101739.000
6   858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24      NA       <NA>    NA         NA         NA
7     2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17    2488 2016-12-31  2016   3321.000   2905.000
8  1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03      NA       <NA>    NA         NA         NA
9  1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10 1467858 2017-12-31  2017 212482.000 176282.000
10   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20   14693 2016-04-30  2015   4183.000   2621.000

这里我们可以看到 ID= 50341 是正确连接的.

Here we can see that ID= 50341 is joined up correctly.

当我尝试以 data.table 方式运行时,我得到以下输出:

When I try to run the data.table way I get this output:

代码:

dt_final_data <- setDT(df2)[df1, on = .(ID, date > start_date, date <= end_date)]

输出:

         ID       date fyear         at         lt     date.1     f_date               ACCNUM    flmNUM
 1:   50341 2002-09-07  2002 190453.000 173620.000 2003-08-30 2002-03-08 0001104659-02-000656   2571187
 2: 1067983 2010-05-27  2010 372229.000 209295.000 2011-05-19 2009-11-25 0001047469-09-010426  91207220
 3:  804753 2004-11-13  2004    982.265    383.614 2005-11-05 2004-05-14 0001193125-04-088404   4805453
 4: 1090727 2013-11-21  2013  36212.000  29724.000 2014-11-13 2013-05-22 0000712515-13-000022  13865105
 5: 1467858 2010-08-28  2010 138898.000 101739.000 2011-08-20 2010-02-26 0001193125-10-043035  10640035
 6:  858877 2019-08-02    NA         NA         NA 2020-07-24 2019-01-31 0001166691-19-000005  19556540
 7:    2488 2016-08-25  2016   3321.000   2905.000 2017-08-17 2016-02-24 0001193125-16-476010 161452982
 8: 1478242 2004-09-11    NA         NA         NA 2005-09-03 2004-03-12 0001193125-04-039482   4664082
 9: 1467858 2017-08-18  2017 212482.000 176282.000 2018-08-10 2017-02-16 0001555280-17-000044  17618235
10:   14693 2016-04-28  2015   4183.000   2621.000 2017-04-20 2015-10-28 0001193125-15-356351 151180619
dt_final_data

df1 中的 start_date 现在变成 df1 中的 dateend_date已成为 date.1.因此,我在 df2 中的原始 date 列已经消失,这是检查合并是否正常工作的更重要的日期之一.

Here start_date in df1 has now become date and end_date in df1 has become date.1. Therefore my original date column in df2 has disappeared which is one of the more important dates for checking if the merge worked as it should.

两个问题:

如何像 fuzzyjoin 示例一样保留所有日期列?data.table 更改名称的方式使我在检查连接时有点混乱.

How can I keep all the date columns as in the fuzzyjoin example? The way data.table has changed the names makes it a little confusing when I am checking the join.

代码/逻辑是否正确?我已经多次查看这个连接的数据,它看起来"是正确的.

Is the code/logic correct? I have looked at this joined data a number of times and it "appears" correct.

数据1:

df1 <- 
    structure(list(ID = c(50341L, 1067983L, 804753L, 1090727L, 1467858L, 
858877L, 2488L, 1478242L, 1467858L, 14693L), f_date = structure(c(11754, 
14573, 12552, 15847, 14666, 17927, 16855, 12489, 17213, 16736
), class = "Date"), ACCNUM = c("0001104659-02-000656", "0001047469-09-010426", 
"0001193125-04-088404", "0000712515-13-000022", "0001193125-10-043035", 
"0001166691-19-000005", "0001193125-16-476010", "0001193125-04-039482", 
"0001555280-17-000044", "0001193125-15-356351"), flmNUM = c(2571187L, 
91207220L, 4805453L, 13865105L, 10640035L, 19556540L, 161452982L, 
4664082L, 17618235L, 151180619L), 
start_date = structure(c(11937, 14756, 12735, 16030, 14849, 18110, 17038, 
                         12672, 17396, 16919), class = "Date"), 
end_date = structure(c(12294, 15113, 13092, 16387, 15206, 18467, 17395, 13029,
                       17753, 17276), class = "Date")
), row.names = c(NA, -10L), class = "data.frame")

数据2:

df2 <-
    structure(list(ID = c(2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 
2488L, 2488L, 2488L, 2488L, 2488L, 2488L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 
1067983L, 1067983L, 1067983L, 1067983L, 1067983L, 14693L, 14693L, 
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 
14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 14693L, 
14693L, 14693L, 14693L, 50341L, 50341L, 50341L, 50341L, 50341L, 
50341L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 1467858L, 
1467858L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 1090727L, 
1090727L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 804753L, 
804753L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L, 1478242L, 
1478242L, 1478242L, 1478242L, 1478242L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 858877L, 
858877L, 858877L, 858877L, 858877L), date = structure(c(10591, 
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878, 
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166, 
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783, 
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070, 
16435, 16800, 17166, 17531, 17896, 10346, 10711, 11077, 11442, 
11807, 12172, 12538, 12903, 13268, 13633, 13999, 14364, 14729, 
15094, 15460, 15825, 16190, 16555, 16921, 17286, 17651, 10591, 
10956, 11322, 11687, 12052, 12417, 10591, 10956, 11322, 11687, 
12052, 12417, 12783, 13148, 13513, 13878, 14244, 14609, 14974, 
15339, 15705, 16070, 16435, 16800, 17166, 17531, 17896, 10591, 
10956, 11322, 11687, 12052, 12417, 12783, 13148, 13513, 13878, 
14244, 14609, 14974, 15339, 15705, 16070, 16435, 16800, 17166, 
17531, 17896, 10591, 10956, 11322, 11687, 12052, 12417, 12783, 
13148, 13513, 13878, 14244, 14609, 14974, 15339, 15705, 16070, 
16435, 16800, 17166, 17531, 17896, 14609, 14974, 15339, 15705, 
16070, 16435, 16800, 17166, 17531, 17896, 10438, 10803, 11169, 
11534, 11899, 12264, 12630, 12995, 13360, 13725, 14091, 14456, 
14821, 15186, 15552, 15917, 16282, 16647, 17013, 17378, 17743
), class = "Date"), fyear = c(1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 1997L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 
2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 
2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L, 
2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 1998L, 1999L, 
2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 
2017L, 2018L, 1998L, 1999L, 2000L, 2001L, 2002L, 2003L, 2004L, 
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 2018L), at = c(4252.968, 4377.698, 
5767.735, 5647.242, 5619.181, 7094.345, 7844.21, 7287.779, 13147, 
11550, 7675, 9078, 4964, 4954, 4000, 4337, 3767, 3109, 3321, 
3540, 4556, 122237, 131416, 135792, 162752, 169544, 180559, 188874, 
198325, 248437, 273160, 267399, 297119, 372229, 392647, 427452, 
484931, 526186, 552257, 620854, 702095, 707794, 1494, 1735, 1802, 
1939, 2016, 2264, 2376, 2624, 2728, 3551, 3405, 3475, 3383, 3712, 
3477, 3626, 4103, 4193, 4183, 4625, 4976, 104382, 190692, 179519, 
203638, 190453, 200235, 257389, 274730, 303100, 323969, 370782, 
448507, 479921, 476078, 186192, 148883, 91047, 136295, 138898, 
144603, 149422, 166344, 177677, 194520, 221690, 212482, 227339, 
17067, 23043, 21662, 24636, 26357, 28909, 33026, 35222, 33210, 
39042, 31879, 31883, 33597, 34701, 38863, 36212, 35471, 38311, 
40377, 45403, 50016, 436.485, 660.891, 616.411, 712.302, 779.279, 
859.34, 982.265, 1303.629, 1491.39, 1689.956, 1880.988, 2148.567, 
2422.79, 3000.358, 3704.468, 4098.364, 4530.565, 5561.984, 5629.963, 
6469.311, 6708.636, NA, NA, 2322.917, 2499.153, 3066.797, 3305.832, 
3926.316, 21208, 22742, 22549, 8916.705, 14725, 32870, 35238, 
37795, 37107, 35594, 33883, 43315, 53340, 58734, 68128, 81130, 
87095, 91759, 101191, 105134, 113481, 121652, 129818, 108784), 
    lt = c(2247.919, 2398.425, 2596.068, 2092.187, 3151.916, 
    3938.395, 3993.516, 3700.954, 7072, 8295, 7588, 7354, 3951, 
    3364, 3462, 3793, 3580, 3521, 2905, 2929, 3290, 63190, 72232, 
    72799, 103453, 104116, 102218, 102216, 106025, 137756, 149759, 
    153820, 161334, 209295, 223686, 235864, 260446, 283159, 293630, 
    334495, 350141, 355294, 677, 818, 754, 752, 705, 1424, 1291, 
    1314, 1165, 1978, 1680, 1659, 1488, 1652, 1408, 1998, 2071, 
    2288, 2621, 3255, 3660, 94973, 175385, 163347, 186030, 173620, 
    181955, 241738, 253490, 272218, 303516, 363134, 422932, 452164, 
    460442, 190443, 184363, 176387, 107340, 101739, 105612, 112422, 
    123170, 141653, 154197, 177615, 176282, 184562, 9894, 10569, 
    11927, 14388, 13902, 14057, 16642, 18338, 17728, 26859, 25099, 
    24187, 25550, 27593, 34130, 29724, 33313, 35820, 39948, 44373, 
    46979, 165.342, 281.954, 272.694, 317.463, 338.035, 363.494, 
    383.614, 541.81, 571.972, 556.242, 568.693, 567.769, 517.373, 
    689.557, 870.818, 930.7, 964.597, 1691.6, 1702.016, 1683.963, 
    1780.247, NA, NA, 3292.513, 3858.197, 3734.282, 4009.844, 
    4261.997, 12348, 14384, 15595, 1766.98, 3003, 6328, 8096, 
    9124, 9068, 9678, 10699, 19397, 21850, 24332, 29451, 36845, 
    39836, 40458, 42063, 48473, 53774, 58067, 63681, 65580)), row.names = c(NA, 
-163L), class = "data.frame")

推荐答案

澄清术语:

解决您的问题的 data.table 方法不需要与 data.table 进行模糊连接(至少在不精确匹配的意义上).相反,您只想使用不相等的二元运算符 >=><= 和/或 <.在 data.table 术语中,这些称为非等值连接".

To clarify terminology:

The data.table approach for your problem does not require a fuzzyjoin with data.table [at least not in the sense of inexact matching]. Instead, you just want to join on data.table columns using non-equal binary operators >=,>, <= and/or <. In data.table terminology those are called "non equi joins".

您在第一次工作尝试中使用 library(fuzzyjoin) 之后,将问题命名为使用 data.table 模糊连接两个数据框",这是可以理解的.(没问题,只是为读者澄清一下.)

Where you titled your question "fuzzyjoin two data frames using data.table" that is just, understandably, after you used library(fuzzyjoin) in your first working attempt. (No problem, just clarifying for readers.)

您已经非常接近工作的 data.table 解决方案了:

You were very close to a working data.table solution where you had:

dt_final_data <- setDT(df2)[df1, 
                            on = .(ID, date > start_date, date <= end_date)]

要修改它以使其按您想要的方式工作,只需添加一个 data.table j 表达式以按照您想要的顺序选择您想要的列并在问题列前加上 x. (告诉 data.table 从 dt_x[dt_i,] 的 x 一侧返回该列 join) 比如如下调用列x.date:

To modify it to make it work as you want, simply add a data.table j expression to select the columns you want, in the order you want them and prefix the problem column with x. (to tell data.table to return the column from the x side of the dt_x[dt_i,] join) For example, as below calling the column x.date:

dt_final_data <- setDT(df2)[df1, 
                            .(ID, f_date, ACCNUM, flmNUM, start_date, end_date, x.date, fyear, at, lt), 
                            on = .(ID, date > start_date, date <= end_date)]

这会为您提供您所追求的输出:

This now gives you the output you are after:

dt_final_data
         ID     f_date               ACCNUM    flmNUM start_date   end_date     x.date fyear         at         lt
 1:   50341 2002-03-08 0001104659-02-000656   2571187 2002-09-07 2003-08-30 2002-12-31  2002 190453.000 173620.000
 2: 1067983 2009-11-25 0001047469-09-010426  91207220 2010-05-27 2011-05-19 2010-12-31  2010 372229.000 209295.000
 3:  804753 2004-05-14 0001193125-04-088404   4805453 2004-11-13 2005-11-05 2004-12-31  2004    982.265    383.614
 4: 1090727 2013-05-22 0000712515-13-000022  13865105 2013-11-21 2014-11-13 2013-12-31  2013  36212.000  29724.000
 5: 1467858 2010-02-26 0001193125-10-043035  10640035 2010-08-28 2011-08-20 2010-12-31  2010 138898.000 101739.000
 6:  858877 2019-01-31 0001166691-19-000005  19556540 2019-08-02 2020-07-24       <NA>    NA         NA         NA
 7:    2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17 2016-12-31  2016   3321.000   2905.000
 8: 1478242 2004-03-12 0001193125-04-039482   4664082 2004-09-11 2005-09-03       <NA>    NA         NA         NA
 9: 1467858 2017-02-16 0001555280-17-000044  17618235 2017-08-18 2018-08-10 2017-12-31  2017 212482.000 176282.000
10:   14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20 2016-04-30  2015   4183.000   2621.000

如上所述,ID=50341 的结果现在具有 date=2002-12-31.换句话说,结果列 date 现在来自 df2.date.

As above, your result for ID=50341 now has date=2002-12-31. In other words, the result column date now comes from df2.date.

您当然可以在 j 表达式中重命名 x.date 列:

You can of course rename the x.date column in your j expression:

setDT(df2)[ df1, 
            .(ID, 
              f_date, 
              ACCNUM, 
              flmNUM, 
              start_date, 
              end_date, 
              my_result_date_name = x.date, 
              fyear, 
              at, 
              lt), 
            on = .(ID, date > start_date, date <= end_date)]

为什么data.table(当前)重命名非等连接中的列并从不同的列返回数据:

这个解释来自@ScottRitchie 很好地总结了:

Why does data.table (currently) rename columns in non-equi joins and return data from a different column:

This explanation from @ScottRitchie sums it up quite nicely:

执行任何连接时,结果中只返回每个键列的一个副本.目前,返回 i 中的列,并用 x 中的列名进行标记,使 equi 连接与基本 merge() 的行为一致.

When performing any join, only one copy of each key column is returned in the result. Currently, the column from i is returned, and labelled with the column name from x, making equi joins consistent with the behaviour of base merge().

如果您记得在 1.9.8 版本之前 data.table 没有非等连接,那么以上内容是有道理的.

Above makes sense if you keep in mind back before version 1.9.8 data.table didn't have non-equi joins.

通过并包括当前 1.12.2 版本的 data.table,这个(和几个重叠的问题)已经成为 data.table github 问题列表上大量讨论的来源.例如:非等连接可能不一致,返回连接列 #3437非 equi 和滚动连接的类似 SQL 的列返回 #2706 只是 2很多.

Through and including the current 1.12.2 release of data.table, this (and several overlapping issues) have been the source a lot of discussion on the data.table github issues list. For example: possible inconsistency in non-equi join, returning join columns #3437 and SQL-like column return for non-equi and rolling joins #2706 are just 2 of many.

但是,请观看这​​个 github 问题: 继续上述讨论,data.table 团队的敏锐分析头脑正在努力在某些(希望不会太远)未来版本中减少混淆:滚动和非等值连接的列 #3093

However, watch this github issue: Continuing from the above discussions the keen analytical minds of the data.table team are working to make this less confusing in some (hopefully not too distant) future version: Both columns for rolling and non-equi joins #3093

这篇关于使用 data.table 模糊连接两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆