基于开始和结束位置的有效标记方法 [英] Efficient way of labelling based on start and end position
本文介绍了基于开始和结束位置的有效标记方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有2个数据框
das <- data.frame(val=1:20,
type =c("A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","C","C"),
weigh=c(20,22,23,32,34,54,19,22,24,26,31,34,36,37,51,54,31,35,43,45))
mapper <- data.frame(type=c("A","A","A","A","B","B","B","B","C","C","C","C"),start = c(19,23,27,37 ,17,25,39,50, 17,23,33,39),end = c(23,27,37,55 ,25,39,50,60, 23,33,39,48))
预期输出为
val type weigh labelweight
1 1 A 20 A_19
2 2 A 22 A_19
3 3 A 23 A_23
4 4 A 32 A_27
5 5 A 34 A_27
6 6 A 54 A_37
7 7 B 19 B_17
8 8 B 22 B_17
9 9 B 24 B_17
10 10 B 26 B_25
11 11 B 31 B_25
12 12 B 34 B_25
13 13 B 36 B_25
14 14 B 37 B_25
15 15 B 51 B_50
16 16 B 54 B_50
17 17 C 31 C_23
18 18 C 35 C_33
19 19 C 43 C_39
20 20 C 45 C_39
我可以使用以下代码获得预期的输出
I am able to get the expected output with following code
p <- left_join(das,mapper)
q <- p%>%filter(weigh>=start & weigh<end)%>%mutate(labelweight= paste0(type,"_",start))
在处理大型数据集时,无论我想出什么,代码都会抛出错误:向量内存已耗尽(已达到极限?)".
The code whatever I came up with is throwing "Error: vector memory exhausted (limit reached?)" when dealing with large datasets.
我在考虑是否有更有效的方法来获得所需的输出而不进行连接.
I am thinking if there is any more efficient way of getting the desired output without doing a join.
推荐答案
时间间隔似乎是连续的.这是在data.table
中使用滚动连接的快速选择:
The intervals appears to be contiguous. Here is a fast option using rolling join in data.table
:
library(data.table)
setDT(das)[, weight :=
setDT(mapper)[.SD, on=.(type, start=weigh), roll=Inf, paste(type, x.start, sep="_")]
]
如果间隔不是连续的,则可以使用非等距联接:
If the intervals are not contiguous, you can use a non-equi join:
setDT(das)[, weight :=
setDT(mapper)[setDT(das), on=.(type, start<=weigh, end>weigh), paste(type, x.start, sep="_")]
]
输出:
val type weigh weight
1: 1 A 20 A_19
2: 2 A 22 A_19
3: 3 A 23 A_23
4: 4 A 32 A_27
5: 5 A 34 A_27
6: 6 A 54 A_37
7: 7 B 19 B_17
8: 8 B 22 B_17
9: 9 B 24 B_17
10: 10 B 26 B_25
11: 11 B 31 B_25
12: 12 B 34 B_25
13: 13 B 36 B_25
14: 14 B 37 B_25
15: 15 B 51 B_50
16: 16 B 54 B_50
17: 17 C 31 C_23
18: 18 C 35 C_33
19: 19 C 43 C_39
20: 20 C 45 C_39
这篇关于基于开始和结束位置的有效标记方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文