基于开始和结束位置的有效标记方法 [英] Efficient way of labelling based on start and end position

查看:86
本文介绍了基于开始和结束位置的有效标记方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个数据框

das <- data.frame(val=1:20,
              type =c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C"),
              weigh=c(20,22,23,32,34,54,19,22,24,26,31,34,36,37,51,54,31,35,43,45))

mapper <- data.frame(type=c("A","A","A","A","B","B","B","B","C","C","C","C"),start = c(19,23,27,37   ,17,25,39,50, 17,23,33,39),end = c(23,27,37,55  ,25,39,50,60, 23,33,39,48))

预期输出为

val type weigh labelweight
1    1    A    20    A_19
2    2    A    22    A_19
3    3    A    23    A_23
4    4    A    32    A_27
5    5    A    34    A_27
6    6    A    54    A_37
7    7    B    19    B_17
8    8    B    22    B_17
9    9    B    24    B_17
10  10    B    26    B_25
11  11    B    31    B_25
12  12    B    34    B_25
13  13    B    36    B_25
14  14    B    37    B_25
15  15    B    51    B_50
16  16    B    54    B_50
17  17    C    31    C_23
18  18    C    35    C_33
19  19    C    43    C_39
20  20    C    45    C_39

我可以使用以下代码获得预期的输出

I am able to get the expected output with following code

p <- left_join(das,mapper)
q <- p%>%filter(weigh>=start & weigh<end)%>%mutate(labelweight= paste0(type,"_",start))

在处理大型数据集时,无论我想出什么,代码都会抛出错误:向量内存已耗尽(已达到极限?)".

The code whatever I came up with is throwing "Error: vector memory exhausted (limit reached?)" when dealing with large datasets.

我在考虑是否有更有效的方法来获得所需的输出而不进行连接.

I am thinking if there is any more efficient way of getting the desired output without doing a join.

推荐答案

时间间隔似乎是连续的.这是在data.table中使用滚动连接的快速选择:

The intervals appears to be contiguous. Here is a fast option using rolling join in data.table:

library(data.table)
setDT(das)[, weight := 
    setDT(mapper)[.SD, on=.(type, start=weigh), roll=Inf, paste(type, x.start, sep="_")]
]

如果间隔不是连续的,则可以使用非等距联接:

If the intervals are not contiguous, you can use a non-equi join:

setDT(das)[, weight := 
    setDT(mapper)[setDT(das), on=.(type, start<=weigh, end>weigh), paste(type, x.start, sep="_")]        
]

输出:

    val type weigh weight
 1:   1    A    20   A_19
 2:   2    A    22   A_19
 3:   3    A    23   A_23
 4:   4    A    32   A_27
 5:   5    A    34   A_27
 6:   6    A    54   A_37
 7:   7    B    19   B_17
 8:   8    B    22   B_17
 9:   9    B    24   B_17
10:  10    B    26   B_25
11:  11    B    31   B_25
12:  12    B    34   B_25
13:  13    B    36   B_25
14:  14    B    37   B_25
15:  15    B    51   B_50
16:  16    B    54   B_50
17:  17    C    31   C_23
18:  18    C    35   C_33
19:  19    C    43   C_39
20:  20    C    45   C_39

这篇关于基于开始和结束位置的有效标记方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆