stringer 和 grepl 和 grep 中的 str_detect 函数有什么区别? [英] What's the difference between the str_detect function in stringer and grepl and grep?
问题描述
我开始在我的工作中进行大量的字符串匹配,我很好奇这三个函数之间的区别是什么,以及在什么情况下有人会使用一个而不是另一个.
I'm starting to do a lot of string matching in my work and I'm curious as to what the differences between the three functions are, and in what situations someone would use one over the other.
推荐答案
stringr
是一组一致、简单且易于使用的围绕奇妙的 'stringi' 包的包装器".(来自包描述).与基本的 R
相比,stringi
的主要优点是包速度惊人.函数的输出在 base 中与 stringr 中相同.
stringr
is a "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package" (from package description). The main advantage of stringi
is the incredible speed of the package compared to base R
. The output of the functions is the same in base as in stringr.
我使用 stringi
生成一些随机文本用于演示:
I use stringi
to generate some random text for demonstration:
library(stringr)
sample_small <- stringi::stri_rand_lipsum(100)
grep
提供模式在字符向量中的位置,就像 str_which
所做的一样:
grep
provides the position of a pattern in the character vector, just as it's equivalent str_which
does:
grep("Lorem", sample_small)
#> [1] 1 9 14 32 45 50 65 93 94
str_which(sample_small, "Lorem")
#> [1] 1 9 14 32 45 50 65 93 94
grepl
/str_detect
另一方面,为您提供向量的每个元素的信息,无论是否包含字符串.
grepl
/str_detect
on the other hand give you the information for each element of the vector, if it contains the string or not.
grepl("Lorem", sample_small)
#> [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#> [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [89] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
str_detect(sample_small, "Lorem")
#> [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#> [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#> [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [89] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
在许多情况下,不同的结果可能会对您产生影响.我通常使用 grepl
如果我有兴趣向 data.frame 添加一个新列,该列包含有关不同列是否包含模式的信息.grepl
使这更容易,因为它与输入变量的长度相同:
There are many scenarios where the different outcome could make a difference for you. I'm usually using grepl
if I'm interested in adding a new column to a data.frame that contains information on whether a different column contains a pattern. grepl
makes this easier as it has the same length as the input variable:
df <- data.frame(sample = sample_small,
stringsAsFactors = FALSE)
df$lorem <- grepl("Lorem", sample_small)
df$ipsum <- grepl("ipsum", sample_small)
这样,一些更复杂的测试是可能的:
This way, some more elaborate tests are possible:
which(df$lorem & df$ipsum)
#> [1] 1 5 15 53 71 75
或者直接作为filter
规则:
df %>%
filter(str_detect("Lorem", sample_small) & str_detect("ipsum", sample_small))
现在就为什么要使用 stringr
而不是 base 而言,我认为有两个参数:不同的语法使得 stringr
与管道一起使用更容易>
Now in terms of why to use stringr
over base, I think there are two arguments: different syntax makes it a little bit easier to use stringr
with pipes
library(dplyr)
sample_small %>%
str_detect("Lorem")
相比:
sample_small %>%
grepl("Lorem", .)
而且 stringr
大约比 base 快 5 倍(对于我们正在研究的两个函数):
And stringr
is roughly 5x faster than base (for the two functions we are looking at):
sample_big <- stringi::stri_rand_lipsum(100000)
bench::mark(
base = grep("Lorem", sample_big),
stringr = str_which(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 674ms 674ms 1.48 415KB 0
#> 2 stringr 141ms 142ms 6.99 806KB 0
bench::mark(
base = grepl("Lorem", sample_big),
stringr = str_detect(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 679ms 679ms 1.47 391KB 0
#> 2 stringr 146ms 148ms 6.76 391KB 0
当我们寻找完全匹配的时候(默认是寻找正则表达式),差别就更显着了
The difference is even more striking when we look for exact matches (the default is to look for regular expressions)
bench::mark(
base = grepl("Lorem", sample_big, fixed = TRUE),
stringr = str_detect(sample_big, fixed("Lorem"))
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 336ms 338.1ms 2.96 391KB 0
#> 2 stringr 12.4ms 12.6ms 79.1 417KB 0
不过,我觉得基本函数对它们有一定的魅力,这就是为什么我在快速编写代码时仍然经常使用它们的原因.选项 fixed = TRUE
就是一个例子.将 fixed()
包裹在模式周围对我来说感觉有点尴尬.其他示例是 grep
中的选项 value = TRUE
(我让你自己弄清楚),最后是 ignore.case = TRUE
,再次在 stringr
中看起来有点尴尬:
However, I think the base functions have a certain charm to them, which is why I often still use them when writing code quickly. The option fixed = TRUE
is one example. Wrapping fixed()
around the pattern feels just a little awkward to me. Other examples would be the option value = TRUE
in grep
(I let you figure that one out yourself) and finally ignore.case = TRUE
which, again looks a little awkward in stringr
:
str_which(sample_small, regex("Lorem", ignore_case = TRUE))
#> [1] 1 5 6 8 9 11 12 14 15 17 22 27 30 32 34 35 42 48 51 53 58 64 69
#> [24] 74 76 80 83 86 89 91 92 94 97
然而,这对我来说很尴尬的原因可能只是因为我在学习stringr
之前使用了一段时间的基本R
.
However, the reason this is awkward for me is probably just because I used base R
for a while before learning stringr
.
要考虑的另一点是,使用 stringi
,您可以拥有更多的整体功能.因此,如果您下定决心要进行字符串操作,您可能会立即开始学习该软件包 - 尽管教程较少,而且弄清楚某些事情可能会更困难一些.
Another point to consider is that with stringi
, you have even more features overall. So if you are determined to get into string manipulation, you might start to learn that package right away - although there are admittedly less tutorials and it might be a bit tougher to figure some things out.
这篇关于stringer 和 grepl 和 grep 中的 str_detect 函数有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!