在R中快速计数字符串中的数字 [英] Fast count of digits in a string, in R

查看:207
本文介绍了在R中快速计数字符串中的数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有更有效的方法来计算字符串中最频繁出现的数字?我下面的R代码为每个字符串调用 gsub() 10次;

Is there a more efficient way to count the most frequently appearing digit in a string? My R code below calls gsub() 10 times for each string; and I have gazillions of strings to process.

> txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
> max(vapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', txt)), integer(1L)))
[1] 12

我不在乎数字本身。我只想要最常使用的软件包。

I don't care about the digit itself. I just want the count of the most frequent one.

我更喜欢使用R的核心软件包,除非某些外部软件包提供了显着的性能。我在Windows 10上使用x64 R版本3.4.1(2017-06-30)。

I would prefer to use R's core packages, unless some external package offers a significant outperformance. I use x64 R version 3.4.1 (2017-06-30) on Windows 10.

更新:

这是下面出色建议的性能比较。

Here is the (apples-to-apples) performance comparison of excellent suggestions below.

> microbenchmark(
+     original = max(vapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', s)), integer(1L))),
+     strsplit = max(table(unlist(strsplit(gsub("\\D+", "", s), "")))),
+     gregexpr = max(vapply(0:9, function(d) sum(unlist(gregexpr(d, s)) > 0), integer(1L))),
+     stringi = max(vapply(0:9, function(x) stri_count_fixed(s, x), integer(1L))),
+     raw=max(vapply(0x30:0x39, function(x) sum(charToRaw(s)==x), integer(1L))),
+     tabulate = max(tabulate(as.integer(charToRaw(paste('a',s))))[48:57]),
+     times=1000L)
Unit: microseconds
     expr     min       lq      mean   median       uq      max neval
 original 476.172 536.9770 567.86559 554.8600 580.0530 8054.805  1000
 strsplit 366.071 422.3660 448.69815 445.3810 469.6410  798.389  1000
 gregexpr 302.622 345.2325 423.08347 360.3170 378.0455 9082.416  1000
  stringi 112.589 135.2940 149.82411 144.6245 155.1990 3910.770  1000
      raw  58.161  71.5340  83.57614  77.1330  82.1090 6249.642  1000
 tabulate  18.039  29.8575  35.20816  36.3890  40.7430   72.779  1000

为什么要进行奇怪的计算?

此奇数公式有助于识别用户输入的一些看上去虚假的标识符。例如,一些非创意用户(我也很内)填写了他们电话号码的相同数字。通常,在数据分析中,完全没有电话号码比从一个数据集更改为另一个数据集的伪造电话号码更好。自然,如果有校验位,那将是一个额外的简便验证。

This odd formula helps identify some plainly-looking fake identifiers entered by the user. For example, some non-creative users (I'm a guilty one as well) fill out same digits for their phone numbers. Frequently, in data analysis, it would be better to have no phone number at all than a fake phone number that changes from one dataset to another. Naturally, if there is a check-digit, it would be an additional easy validation.

推荐答案

使用 charToRaw 计算字符串中的数字:

Using charToRaw to count digits in string:

# To count only digits in string, filter out ASCii codes for numbers from 0 to 9 which is 48 to 57 according to https://ascii.cl/
# You need to add na.rm = TRUE in case some of your strings contain only one digit
txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
max(tabulate(as.integer(charToRaw(txt)))[48:57], na.rm = TRUE)
#[1] 12

txt='22222222222'
max(tabulate(as.integer(charToRaw(txt)))[48:57], na.rm = TRUE)
#[1] 11

@Andrew已经进行了基准测试,证明使用 charToRaw 是计算字符串中数字的最快方法。

@Andrew already did benchmarking test which proves that using charToRaw is fastest approach to count digits in string.


如果您不关心数字而只想计数那么最常用的字符/数字只需删除过滤ASCII码[48:57]。

If you do not care about the digit and just want to count most frequent character/digit then you just remove filtering ASCII codes [48:57].



txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
max(tabulate(as.integer(charToRaw(txt))))
#[1] 32

txt='22222222222'
max(tabulate(as.integer(charToRaw(txt))))
#[1] 11

这篇关于在R中快速计数字符串中的数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆