由 data.table 中的第二个变量分组的滚动连接 [英] Rolling join grouped by a second variable in data.table
问题描述
您好,我想使用 data.table
包在 R 中执行滚动连接.加入日期"列时有多个匹配项,因此我想在字段"列的 data.table
中使用 by
参数来保持不同的数据从被连接在一起的领域.
示例数据
d1<-structure(list(Field = c(6", W62", 6", 6", 12S", 19-1",6"、6"、12S"、7"、6"、12S"、W62"、6"、12S"、W62"、12S"、6"、6"、7"、12S"、12S"、W62"、7"、12S"、6"、12S"、7"、12S"、7"、6"、7"、12S"、7"、6"、6"、6"、6"、12S"、7"、7"、6"、6"、12S"、7"、12S"、12S"、12S"、19-1"、"6"), 日期 = 结构 (c(16994, 17240, 17240, 17401, 17048, 17417,17387、17394、17382、17414、17029、17403、17045、17359、17179、17281、17152、16972、16987、17042、17282、17415、17281、17266、17179、17190、17057、17380、17280、17178、17178、17343、17373、17043、17190、17343、17253、16981、17079、17043、17270、17366、16981、17357、17366、17415、17079、17190、17385、17008),类=日期"),NlbsAcre = c(NA, 18874.6557383659, 2477.08251404958, NA,NA, 19658.0054165823, NA, NA, 12621.0827111083, NA, NA, 16764.41968227,16764.9745173044, 北美, 7671.24950330348, 21341.6444661863,5197.26333885612, NA, NA, NA, 39560.8958554292, 18162.4040880297,22578.1487456647、15842.9161753361、3613.95523726973、2601.07083566694、17766.9873538952, NA, 44728.1837479613, 2279.60909695434,2014.7720270382, 北美, 14847.7006686211, 北美, 3082.31758038481,NA, 2427.53558465175, NA, 23641.2999848709, NA, NA, NA, NA,5928.31591997149, 北美, 22162.2028819815, 18972.2228621189,6534.4257935542, 12630.9231775315, NA)), .Names = c("字段","日期", "NlbsAcre"), 类 = c("data.table", "data.frame"), row.names = c(NA,-50L), .internal.selfref = <指针: 0x0000000006540788>)d2<-结构(列表(字段 = c(6",W62",7",12S",19-1",12S",6"、6"、19-1"、19-1"、6"、7"、W62"、19-1"、12S"、7"、19-1"、7"、12S"、12S"、12S"、7"、6"、7"、6"、7"、W62"、19-1"、6"、6"、12S"、12S"、6"、6"、12S"、6"、12S"、19-1"、6"、W62"、W62"、6"、7"、7"、6"、19-1"、W62"、"6", "12S", "7"), 日期 = 结构 (c(16993, 17140, 17208, 17443,17063、16948、17415、16926、17316、16922、16981、17043、17219、17252、17392、17244、17179、17017、17042、17031、17013、17104、17273、16954、17364、16993、17168、17028、17208、16966、17241、16945、17038、17169、17379、17183、17238、17054、17244、16952、17044、17359、17219、17303、17007、17151、16926、17178、17382、17364), class = "Date"), TotN = c(79.244802845739, 94.193700050628,21.075505564932, 692.152760834712, 224.689064446728, 172.576578578436,47.406177406404、102.53239575903、818.80997295717、476.174916307807、125.828033450364, 58.270026966444, 75.465909993456, 435.049246131543,337.913876678769, 31.714327953234, 305.353940577156, 72.621457768224,393.815453005314, 428.540114240892, 318.97091713563, 73.888113736431,79.0380747113805, 147.493527174027, 65.5311189906495, 59.269732271703,119.390398108236, 110.706003557451, 21.96790939404, 149.060445984684,128.143343232486, 208.621943093862, 75.770138571561, 47.496596179338,132.723654607278, 43.92222198012, 145.150910469252, 215.88105225024,21.393670871196, 72.969536052, 86.335878117078, 103.524169592979,19.920230115264, 44.968722966108, 62.244487239885, 338.593490463303,96.7285416279375、45.537296152302、422.630318314444、58.5336350807685)), .Names = c("Field", "Date", "TotN"), class = c("data.table","data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000006540788>)
我的尝试
这是我对日期"列执行滚动连接的尝试,将 by
分组为字段"列.显然,我可以按字段"拆分数据并单独处理,但我想避免使用该选项.
d1[d2, roll = "nearest", on = .(Date), by = .(Field)]`[.data.table`(d1, d2, roll = "nearest", on = .(Date), by = .(Field)) 中的错误:提供了 'by' 或 'keyby' 但不提供 j
你快到了.
您可以同时加入多个列.因此,除了日期"之外,您还可以在 on
子句中包含字段".但请注意?data.table
中roll
参数的说明:
滚动连接适用于 last 连接列
因此,对于要用于滚动连接的日期",请将其指定为 on
中的 last 变量:
库(data.table)d1[d2, roll = "最近的", on = .(Field, Date)]
为了更好的验证,结果可以排序
d1[d2, roll = "nearest", on = .(Field, Date)][order(Field, Date)]
<块引用>
字段日期 NlbsAcre TotN1: 12S 2016-05-24 不适用 208.621942: 12S 2016-05-27 不适用 172.576583: 12S 2016-07-31 不适用 318.970924: 12S 2016-08-18 不适用 428.540115: 12S 2016-08-29 不适用 393.815456:12S 2017-03-13 44728.184 145.150917:12S 2017-03-16 44728.184 128.143348:12S 2017-08-01 12621.083 132.723659:12S 2017-08-04 12621.083 422.6303210:12S 2017-08-14 12621.083 337.9138811:12S 2017-10-04 22162.203 692.1527612: 19-1 2016-05-01 12630.923 476.1749213: 19-1 2016-08-15 12630.923 110.7060014: 19-1 2016-09-10 12630.923 215.8810515: 19-1 2016-09-19 12630.923 224.6890616: 19-1 2016-12-16 12630.923 338.5934917: 19-1 2017-01-13 12630.923 305.3539418:19-1 2017-03-27 12630.923 435.0492519:19-1 2017-05-30 12630.923 818.8099720:6 2016-05-05 不适用 102.5324021:6 2016-06-14 不适用 149.0604522:6 2016-06-29 不适用 125.8280323:6 2016-06-29 不适用 125.8280324:6 2016-07-11 不适用 79.2448025:6 2016-07-25 不适用 62.2444926:6 2016-08-25 不适用 75.7701427:6 2017-01-03 2014.772 47.4966028:6 2017-01-12 2014.772 45.5373029:6 2017-01-17 2014.772 43.9222230:6 2017-02-11 3082.318 21.9679131:6 2017-03-19 2477.083 21.3936732:6 2017-04-17 2427.536 79.0380733:6 2017-07-12 不适用 103.5241734:6 2017-07-17 不适用 65.5311235:6 2017-09-06 不适用 47.4061836:7 2016-06-02 不适用 147.4935337:7 2016-07-11 不适用 59.2697338:7 2016-08-04 不适用 72.6214639:7 2016-08-30 不适用 58.2700340:7 2016-08-30 不适用 58.2700341:7 2016-10-30 不适用 73.8881142:7 2017-02-11 2279.609 21.0755143:7 2017-02-22 2279.609 19.9202344:7 2017-03-19 15842.916 31.7143345:7 2017-05-17 不适用 44.9687246:7 2017-07-17 不适用 58.5336447:W62 2016-05-05 16764.975 96.7285448:W62 2016-05-31 16764.975 72.9695449:W62 2016-08-31 16764.975 86.3358850:W62 2016-12-05 16764.975 94.1937051:W62 2017-01-02 18874.656 119.3904052:W62 2017-02-22 18874.656 75.46591田间日期 NlbsAcre TotN
Hello I would like to perform a rolling join in R using the data.table
package. There are multiple matches when joining on the "Date" column so I would like to use the by
argument in data.table
on the "Field" column to keep data from different fields from being joined together.
Example Data
d1<-structure(list(Field = c("6", "W62", "6", "6", "12S", "19-1",
"6", "6", "12S", "7", "6", "12S", "W62", "6", "12S", "W62", "12S",
"6", "6", "7", "12S", "12S", "W62", "7", "12S", "6", "12S", "7",
"12S", "7", "6", "7", "12S", "7", "6", "6", "6", "6", "12S",
"7", "7", "6", "6", "12S", "7", "12S", "12S", "12S", "19-1",
"6"), Date = structure(c(16994, 17240, 17240, 17401, 17048, 17417,
17387, 17394, 17382, 17414, 17029, 17403, 17045, 17359, 17179,
17281, 17152, 16972, 16987, 17042, 17282, 17415, 17281, 17266,
17179, 17190, 17057, 17380, 17280, 17178, 17178, 17343, 17373,
17043, 17190, 17343, 17253, 16981, 17079, 17043, 17270, 17366,
16981, 17357, 17366, 17415, 17079, 17190, 17385, 17008), class = "Date"),
NlbsAcre = c(NA, 18874.6557383659, 2477.08251404958, NA,
NA, 19658.0054165823, NA, NA, 12621.0827111083, NA, NA, 16764.41968227,
16764.9745173044, NA, 7671.24950330348, 21341.6444661863,
5197.26333885612, NA, NA, NA, 39560.8958554292, 18162.4040880297,
22578.1487456647, 15842.9161753361, 3613.95523726973, 2601.07083566694,
17766.9873538952, NA, 44728.1837479613, 2279.60909695434,
2014.7720270382, NA, 14847.7006686211, NA, 3082.31758038481,
NA, 2427.53558465175, NA, 23641.2999848709, NA, NA, NA, NA,
5928.31591997149, NA, 22162.2028819815, 18972.2228621189,
6534.4257935542, 12630.9231775315, NA)), .Names = c("Field",
"Date", "NlbsAcre"), class = c("data.table", "data.frame"), row.names = c(NA,
-50L), .internal.selfref = <pointer: 0x0000000006540788>)
d2<-structure(list(Field = c("6", "W62", "7", "12S", "19-1", "12S",
"6", "6", "19-1", "19-1", "6", "7", "W62", "19-1", "12S", "7",
"19-1", "7", "12S", "12S", "12S", "7", "6", "7", "6", "7", "W62",
"19-1", "6", "6", "12S", "12S", "6", "6", "12S", "6", "12S",
"19-1", "6", "W62", "W62", "6", "7", "7", "6", "19-1", "W62",
"6", "12S", "7"), Date = structure(c(16993, 17140, 17208, 17443,
17063, 16948, 17415, 16926, 17316, 16922, 16981, 17043, 17219,
17252, 17392, 17244, 17179, 17017, 17042, 17031, 17013, 17104,
17273, 16954, 17364, 16993, 17168, 17028, 17208, 16966, 17241,
16945, 17038, 17169, 17379, 17183, 17238, 17054, 17244, 16952,
17044, 17359, 17219, 17303, 17007, 17151, 16926, 17178, 17382,
17364), class = "Date"), TotN = c(79.244802845739, 94.193700050628,
21.075505564932, 692.152760834712, 224.689064446728, 172.576578578436,
47.406177406404, 102.53239575903, 818.80997295717, 476.174916307807,
125.828033450364, 58.270026966444, 75.465909993456, 435.049246131543,
337.913876678769, 31.714327953234, 305.353940577156, 72.621457768224,
393.815453005314, 428.540114240892, 318.97091713563, 73.888113736431,
79.0380747113805, 147.493527174027, 65.5311189906495, 59.269732271703,
119.390398108236, 110.706003557451, 21.96790939404, 149.060445984684,
128.143343232486, 208.621943093862, 75.770138571561, 47.496596179338,
132.723654607278, 43.92222198012, 145.150910469252, 215.88105225024,
21.393670871196, 72.969536052, 86.335878117078, 103.524169592979,
19.920230115264, 44.968722966108, 62.244487239885, 338.593490463303,
96.7285416279375, 45.537296152302, 422.630318314444, 58.5336350807685
)), .Names = c("Field", "Date", "TotN"), class = c("data.table",
"data.frame"), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000006540788>)
What I have tried
This is my attempt to perform a rolling join on the "Date" column, grouped by
the "Field" column. Obviously I can split the data by "Field" and process separately but I would like to avoid that option.
d1[d2, roll = "nearest", on = .(Date), by = .(Field)]
Error in `[.data.table`(d1, d2, roll = "nearest", on = .(Date), by = .(Field)) :
'by' or 'keyby' is supplied but not j
You were almost there.
You can join on multiple columns simultaneously. So, in addition to "Date", you can include "Field" in the on
clause. But please note the description of the roll
argument in ?data.table
:
Rolling joins apply to the last join column
Thus, for "Date" to be used for the rolling join, specify it as the last variable in on
:
library(data.table)
d1[d2, roll = "nearest", on = .(Field, Date)]
For better verification, the result can be ordered
d1[d2, roll = "nearest", on = .(Field, Date)][order(Field, Date)]
Field Date NlbsAcre TotN 1: 12S 2016-05-24 NA 208.62194 2: 12S 2016-05-27 NA 172.57658 3: 12S 2016-07-31 NA 318.97092 4: 12S 2016-08-18 NA 428.54011 5: 12S 2016-08-29 NA 393.81545 6: 12S 2017-03-13 44728.184 145.15091 7: 12S 2017-03-16 44728.184 128.14334 8: 12S 2017-08-01 12621.083 132.72365 9: 12S 2017-08-04 12621.083 422.63032 10: 12S 2017-08-14 12621.083 337.91388 11: 12S 2017-10-04 22162.203 692.15276 12: 19-1 2016-05-01 12630.923 476.17492 13: 19-1 2016-08-15 12630.923 110.70600 14: 19-1 2016-09-10 12630.923 215.88105 15: 19-1 2016-09-19 12630.923 224.68906 16: 19-1 2016-12-16 12630.923 338.59349 17: 19-1 2017-01-13 12630.923 305.35394 18: 19-1 2017-03-27 12630.923 435.04925 19: 19-1 2017-05-30 12630.923 818.80997 20: 6 2016-05-05 NA 102.53240 21: 6 2016-06-14 NA 149.06045 22: 6 2016-06-29 NA 125.82803 23: 6 2016-06-29 NA 125.82803 24: 6 2016-07-11 NA 79.24480 25: 6 2016-07-25 NA 62.24449 26: 6 2016-08-25 NA 75.77014 27: 6 2017-01-03 2014.772 47.49660 28: 6 2017-01-12 2014.772 45.53730 29: 6 2017-01-17 2014.772 43.92222 30: 6 2017-02-11 3082.318 21.96791 31: 6 2017-03-19 2477.083 21.39367 32: 6 2017-04-17 2427.536 79.03807 33: 6 2017-07-12 NA 103.52417 34: 6 2017-07-17 NA 65.53112 35: 6 2017-09-06 NA 47.40618 36: 7 2016-06-02 NA 147.49353 37: 7 2016-07-11 NA 59.26973 38: 7 2016-08-04 NA 72.62146 39: 7 2016-08-30 NA 58.27003 40: 7 2016-08-30 NA 58.27003 41: 7 2016-10-30 NA 73.88811 42: 7 2017-02-11 2279.609 21.07551 43: 7 2017-02-22 2279.609 19.92023 44: 7 2017-03-19 15842.916 31.71433 45: 7 2017-05-17 NA 44.96872 46: 7 2017-07-17 NA 58.53364 47: W62 2016-05-05 16764.975 96.72854 48: W62 2016-05-31 16764.975 72.96954 49: W62 2016-08-31 16764.975 86.33588 50: W62 2016-12-05 16764.975 94.19370 51: W62 2017-01-02 18874.656 119.39040 52: W62 2017-02-22 18874.656 75.46591 Field Date NlbsAcre TotN
这篇关于由 data.table 中的第二个变量分组的滚动连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!