通过data.table roll='nearest'函数合并两组数据 [英] Merging two sets of data by data.table roll='nearest' function

查看:17
本文介绍了通过data.table roll='nearest'函数合并两组数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两组数据.

set_A 的样本(总行数:45467):

Sample of set_A (total number of rows: 45467):

ID_a    a1  a2  a3  time_a
2   35694   5245.2  301.6053    00.00944
3   85694   9278.9  301.6051    23.00972
4   65694   9375.2  301.6049    22.00972
5   85653   4375.5  301.6047    19.00972
6   12694   5236.3  301.6045    22.00972
7   85697   5345.2  301.6043    21.00972
8   85640   5274.1  301.6041    20.01000
9   30694   5279.0  301.6039    20.01000

set_B 的样本(总行数:4798):

Sample of set_B (total number of rows: 4798):

ID_b    b1  b2  source  time_b
2   34.20   15.114  set1.csv.1  20.35750
7   67.20   16.114  set1.csv.2  21.35778
12  12.20   33.114  set1.csv.3  22.35806
17  73.20   67.114  set2.csv.1  23.35833
23  88.20   42.114  set2.csv.2  19.35861
28  90.20   52.114  set3.csv.1  00.35889

我对set_B 的结果感兴趣,set_A 中的行与time_a 的最近值匹配time_b(输出总行数:4798).在 set_A 中,time_a 的值可以重复多次(例如 ID_a[8,][ID_a[9,]) - 哪一行将与来自 set_B 的行合并并不重要(在本例中为 ID_b[1,]).预期结果示例:

I am interested in a result where to the set_B, rows from set_A are matched by the nearest values of time_a and time_b (total number of output rows: 4798). In set_A values of time_a could be repeated several times (eg. ID_a[8,] and [ID_a[9,]) - it doesn't really matter which row will be merged with a row from set_B (in this case ID_b[1,]). Example of the expected result:

ID_b    b1  b2  source  time_b      ID_a    a1  a2  a3  time_a
2   34.20   15.114  set1.csv.1  20.35750    8   85640   5274.1  301.6041    20.01000
7   67.20   16.114  set1.csv.2  21.35778    7   85697   5345.2  301.6043    21.00972
12  12.20   33.114  set1.csv.3  22.35806    4   65694   9375.2  301.6049    22.00972
17  73.20   67.114  set2.csv.1  23.35833    3   85694   9278.9  301.6051    23.00972
23  88.20   42.114  set2.csv.2  19.35861    5   85653   4375.5  301.6047    19.00972
28  90.20   52.114  set3.csv.1  00.35889    2   35694   5245.2  301.6053    00.00944

我在 stackoverflow 上遇到了许多类似的问题,我真的很喜欢 data.table 库代码,因为它们看起来非常优雅.然而,我做了几次失败的尝试,我收到了一个基于两个集合(总行数 45467)构建的表,或者只将一列 time_a 合并到 set_B... 尽管如此,我不会挑剔,如果有人有其他想法,我将非常感谢您的帮助.

I went through many similar questions on stackoverflow and I really like data.table library codes as they look really elegant. However, I made several failed attempts where I received either a table constructed based on two sets (total no. of rows 45467) or merged only one column time_a to the set_B... Nevertheless, I won't be picky and if somebody has another idea I would be very grateful for help.

我正在处理的代码示例:

Example of a code on which I'm working on:

setDT(set_B)
setDT(set_A)
setkey(set_B, time_b) [, time_a:=time_b]
test_ab <- set_B[set_A, roll='nearest']

因此,我收到的不仅是一个包含应该被忽略的数据的表,而且还有列名中的一团糟"(例如,包含 ID_a 值的列被称为 time_a).

As a result I am receiving not only a table with data which should be neglaced but also "a mess" in column names (e.g. a column which contains ID_a values is called time_a).

我真的很感谢你的帮助!

I would really appricieate your help!

推荐答案

以下是基于您提供的示例数据的分步示例:

Here is a step-by-step example based on the sample data you give:

# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)    

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)

# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]

unique(set_merged[order(ID_b)], by = "time")
#    ID_b   b1     b2     source   time_b     time ID_a    a1     a2       a3
# 1:    2 34.2 15.114 set1.csv.1 20.35750 20.01000    8 85640 5274.1 301.6041
# 2:    7 67.2 16.114 set1.csv.2 21.35778 21.00972    7 85697 5345.2 301.6043
# 3:   12 12.2 33.114 set1.csv.3 22.35806 22.00972    4 65694 9375.2 301.6049
# 4:   17 73.2 67.114 set2.csv.1 23.35833 23.00972    3 85694 9278.9 301.6051
# 5:   23 88.2 42.114 set2.csv.2 19.35861 19.00972    5 85653 4375.5 301.6047
# 6:   28 90.2 52.114 set3.csv.1  0.35889  0.00944    2 35694 5245.2 301.6053
#      time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6:  0.00944

两条评论:

  1. 我们创建了一个新的 time 列,以避免丢失来自 set_Aset_B 的原始时间列之一.如果需要,您始终可以在加入后删除 time 列.
  2. 我们使用unique按照ID_b的顺序删除重复的time行.您在帖子中提到合并哪一行并不重要",但如果您确实想保留特定行,您可能需要调整这行代码.
  1. We create a new time column to avoid losing one of the original time columns from set_A and set_B. You can always remove the time column after the join if required.
  2. We use unique to remove duplicated time rows by order of ID_b. You mention in your post that "it doesn't really matter which row will be merged" but in case that you do want to retain specific rows, you might need to adjust this line of code.

<小时>

更新(感谢@Henrik)

正如@Henrik 指出的那样,你所追求的实际上是 set_A 相对于 set_B 的滚动连接,在这种情况下你不需要处理与重复的行.


Update (thanks to @Henrik)

As @Henrik pointed out, what you're after is actually a rolling join of set_A with respect to set_B, in which case you don't need to deal with the duplicate rows.

翻译成

library(data.table)
setDT(set_A)
setDT(set_B)    

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]

set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
#   ID_a    a1     a2       a3   time_a     time ID_b   b1     b2     source
#1:    2 35694 5245.2 301.6053  0.00944  0.35889   28 90.2 52.114 set3.csv.1
#2:    3 85694 9278.9 301.6051 23.00972 23.35833   17 73.2 67.114 set2.csv.1
#3:    5 85653 4375.5 301.6047 19.00972 19.35861   23 88.2 42.114 set2.csv.2
#4:    6 12694 5236.3 301.6045 22.00972 22.35806   12 12.2 33.114 set1.csv.3
#5:    7 85697 5345.2 301.6043 21.00972 21.35778    7 67.2 16.114 set1.csv.2
#6:    9 30694 5279.0 301.6039 20.01000 20.35750    2 34.2 15.114 set1.csv.1
#  time_b
#1:  0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750

<小时>

示例数据

set_A <- read.table(text =
    "ID_a    a1  a2  a3  time_a
2   35694   5245.2  301.6053    00.00944
3   85694   9278.9  301.6051    23.00972
4   65694   9375.2  301.6049    22.00972
5   85653   4375.5  301.6047    19.00972
6   12694   5236.3  301.6045    22.00972
7   85697   5345.2  301.6043    21.00972
8   85640   5274.1  301.6041    20.01000
9   30694   5279.0  301.6039    20.01000", header = T)

set_B <- read.table(text =
    "ID_b    b1  b2  source  time_b
2   34.20   15.114  set1.csv.1  20.35750
7   67.20   16.114  set1.csv.2  21.35778
12  12.20   33.114  set1.csv.3  22.35806
17  73.20   67.114  set2.csv.1  23.35833
23  88.20   42.114  set2.csv.2  19.35861
28  90.20   52.114  set3.csv.1  00.35889", header = T)

这篇关于通过data.table roll='nearest'函数合并两组数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆