gsub速度与图案长度 [英] gsub speed vs pattern length

查看:161
本文介绍了gsub速度与图案长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近我一直在广泛使用gsub,我注意到短模式比长模式运行得更快,这不足为奇.这是一个完全可复制的代码:

I've been using gsub extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:

library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)

for (i in rpt) {
  n = n + 1
  print(n)
  patt = paste(rep("a", rpt[n]), collapse = "")
  #time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
  timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
  msecFF[n] = mean(timeFF$time)/1000000.

  timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
  msecFT[n] = mean(timeFT$time)/1000000.
}

library(ggplot2)
library(grid)
library(gridExtra)

axis(1,at=seq(0,1000,200),labels=T)

p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)

如您所见,我正在寻找一个包含a被复制的rpt[n]时间的模式.如预期的那样,斜率是正的.但是,我注意到fixed=T处有300个字符的折点,而fixed=F处有600个字符的折点,然后斜率似乎与以前差不多(请参见下图). 我想这是由于内存,对象大小等引起的.我还注意到,允许的最长pattern是1463个符号,对象大小为1552字节.

As you see, I'm looking for a pattern that contains a replicated rpt[n] times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T and 600 characters with fixed=F and then the slope seems to be approximately as before (see plot below). I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern is 1463 symbols, with object size of 1552 bytes.

有人可以更好地解释纽结吗,为什么要使用300和600个字符?

添加:值得一提的是,我的大多数图案的长度都是5到10个字符,这使我可以在下面的时间查看我的真实数据(不是上面示例中的模型inp).

Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp in the example above) the following timing.

gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern

(我有4k个模式,所以我的模块的总时序大约为200秒,使用gsub且fixed = TRUE时恰好是0.05 x4000.这是处理我的数据和模式的最快方法)

(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)

推荐答案

扭结可能与保持该长度模式所需的位有关.

The kinks might be related to the bits required to hold patterns of that length.

还有另一种方法可以更好地扩展,请使用重复运算符{}指定要查找的重复次数.为了找到超过255个(最大8位整数),您必须指定perl = TRUE.

There is another solution that scales much better, use the repetition operator {} to specify how many repeats you want to find. In order to find more than 255 (8 bit integer max) you'll have to specify perl = TRUE.

patt2 <- paste0('a{',rpt[n],'}')
timeRF <- microbenchmark(gsub(patt2, "b", inp, perl = T), times = 10)

每次搜索可获得大约2.1毫秒的速度,而模式长度没有任何损失.对于较小的图案长度,这比固定值= FALSE快8倍,对于较大的图案长度,它要快60倍左右.

I get speeds of around 2.1 ms per search with no penalty for pattern length. That's about 8x faster than fixed = FALSE for small pattern lengths and about 60x faster for large pattern lengths.

这篇关于gsub速度与图案长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆