查找R中2个整数的最长匹配 [英] Find the longest match of 2 integers in R
问题描述
我有2个带有数字的列表,我需要将一个列表的值与另一个列表匹配.必须根据数字的开头进行匹配.它必须返回可能的最长匹配项的row_id.
I have 2 lists with numbers and I need to match the values of one list with the other. The match has to be done based on the beginning of the number. It has to return the row_id of the longest match that is possible.
lookup value: 12345678
find_list:
a 1
b 12
c 123
d 124
e 125
f 1234
g 1235
在此示例中,我们将与a,b,c,f
匹配,并且R必须返回f
.由于f
是最长的匹配项,因此也是最好的匹配项.
In this example we would have a match with a,b,c,f
and R must return f
. Since f
is the longest and therefore the best match.
我现在在R中使用了startsWith
函数.从该答案中,我选择了最长的值.但是问题在于列表很大.我在find_list
中有1850万个查找值和300,000个可能的值,一段时间后R崩溃.
I now have used the startsWith
function in R. From that answer I choose the value that is the longest. But the problem is that the lists are huge. I have 18.5 Million lookup values and 300,000 possible values in the find_list
and R crashes after a while.
是否有更聪明的方法来做到这一点?
Is there a smarter way to do this?
推荐答案
这是基本R中的一种方法.
Here is one method in base R.
# construct a vector of all possible matches for the lookup value
lookupVec <- floor(lookup * (10 ^ (-1 * (0:(nchar(lookup)-1)))))
这将返回
lookupVec
[1] 1234567 123456 12345 1234 123 12 1
# find the value of the first variable that matches the maximum value
# lower values in the vector
dat$V1[which.min(match(dat$V2, lookupVec))]
[1] f
Levels: a b c d e f g
您可以通过使用相同名称的包中的fastmatch
函数替换基R的match
函数来加快此过程,因为如果您再次搜索这些值,它将散列表值.
You can probably speed this up by replacing base R's match
function with the fastmatch
function from the package of the same name as it will hash the table values if you search over these a second time.
数据
dat <-
structure(list(V1 = structure(1:7, .Label = c("a", "b", "c",
"d", "e", "f", "g"), class = "factor"), V2 = c(1L, 12L, 123L,
124L, 125L, 1234L, 1235L)), .Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -7L))
lookup <- 12345678
这篇关于查找R中2个整数的最长匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!