stringer 和 grepl 和 grep 中的 str_detect 函数有什么区别? [英] What's the difference between the str_detect function in stringer and grepl and grep?

查看:30
本文介绍了stringer 和 grepl 和 grep 中的 str_detect 函数有什么区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始在我的工作中进行大量的字符串匹配,我很好奇这三个函数之间的区别是什么,以及在什么情况下有人会使用一个而不是另一个.

I'm starting to do a lot of string matching in my work and I'm curious as to what the differences between the three functions are, and in what situations someone would use one over the other.

推荐答案

stringr 是一组一致、简单且易于使用的围绕奇妙的 'stringi' 包的包装器".(来自包描述).与基本的 R 相比,stringi 的主要优点是包速度惊人.函数的输出在 base 中与 stringr 中相同.

stringr is a "A consistent, simple and easy to use set of wrappers around the fantastic 'stringi' package" (from package description). The main advantage of stringi is the incredible speed of the package compared to base R. The output of the functions is the same in base as in stringr.

我使用 stringi 生成一些随机文本用于演示:

I use stringi to generate some random text for demonstration:

library(stringr)
sample_small <- stringi::stri_rand_lipsum(100)

grep 提供模式在字符向量中的位置,就像 str_which 所做的一样:

grep provides the position of a pattern in the character vector, just as it's equivalent str_which does:

grep("Lorem", sample_small)
#> [1]  1  9 14 32 45 50 65 93 94
str_which(sample_small, "Lorem")
#> [1]  1  9 14 32 45 50 65 93 94

grepl/str_detect 另一方面,为您提供向量的每个元素的信息,无论是否包含字符串.

grepl/str_detect on the other hand give you the information for each element of the vector, if it contains the string or not.

grepl("Lorem", sample_small)
#>   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE
str_detect(sample_small, "Lorem")
#>   [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
#>  [12] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE

在许多情况下,不同的结果可能会对您产生影响.我通常使用 grepl 如果我有兴趣向 data.frame 添加一个新列,该列包含有关不同列是否包含模式的信息.grepl 使这更容易,因为它与输入变量的长度相同:

There are many scenarios where the different outcome could make a difference for you. I'm usually using grepl if I'm interested in adding a new column to a data.frame that contains information on whether a different column contains a pattern. grepl makes this easier as it has the same length as the input variable:

df <- data.frame(sample = sample_small,
                 stringsAsFactors = FALSE)
df$lorem <- grepl("Lorem", sample_small)
df$ipsum <- grepl("ipsum", sample_small)

这样,一些更复杂的测试是可能的:

This way, some more elaborate tests are possible:

which(df$lorem & df$ipsum)
#> [1]  1  5 15 53 71 75

或者直接作为filter规则:

df %>% 
  filter(str_detect("Lorem", sample_small) & str_detect("ipsum", sample_small))

现在就为什么要使用 stringr 而不是 base 而言,我认为有两个参数:不同的语法使得 stringr 与管道一起使用更容易>

Now in terms of why to use stringr over base, I think there are two arguments: different syntax makes it a little bit easier to use stringr with pipes

library(dplyr)
sample_small %>% 
  str_detect("Lorem")

相比:

sample_small %>% 
  grepl("Lorem", .) 

而且 stringr 大约比 base 快 5 倍(对于我们正在研究的两个函数):

And stringr is roughly 5x faster than base (for the two functions we are looking at):

sample_big <- stringi::stri_rand_lipsum(100000)
bench::mark(
  base = grep("Lorem", sample_big),
  stringr = str_which(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          674ms    674ms      1.48     415KB        0
#> 2 stringr       141ms    142ms      6.99     806KB        0


bench::mark(
  base = grepl("Lorem", sample_big),
  stringr = str_detect(sample_big, "Lorem")
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          679ms    679ms      1.47     391KB        0
#> 2 stringr       146ms    148ms      6.76     391KB        0

当我们寻找完全匹配的时候(默认是寻找正则表达式),差别就更显着了

The difference is even more striking when we look for exact matches (the default is to look for regular expressions)

bench::mark(
  base = grepl("Lorem", sample_big, fixed = TRUE),
  stringr = str_detect(sample_big, fixed("Lorem"))
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 base          336ms  338.1ms      2.96     391KB        0
#> 2 stringr      12.4ms   12.6ms     79.1      417KB        0

不过,我觉得基本函数对它们有一定的魅力,这就是为什么我在快速编写代码时仍然经常使用它们的原因.选项 fixed = TRUE 就是一个例子.将 fixed() 包裹在模式周围对我来说感觉有点尴尬.其他示例是 grep 中的选项 value = TRUE(我让你自己弄清楚),最后是 ignore.case = TRUE,再次在 stringr 中看起来有点尴尬:

However, I think the base functions have a certain charm to them, which is why I often still use them when writing code quickly. The option fixed = TRUE is one example. Wrapping fixed() around the pattern feels just a little awkward to me. Other examples would be the option value = TRUE in grep (I let you figure that one out yourself) and finally ignore.case = TRUE which, again looks a little awkward in stringr:

str_which(sample_small, regex("Lorem", ignore_case = TRUE))
#>  [1]  1  5  6  8  9 11 12 14 15 17 22 27 30 32 34 35 42 48 51 53 58 64 69
#> [24] 74 76 80 83 86 89 91 92 94 97

然而,这对我来说很尴尬的原因可能只是因为我在学习stringr之前使用了一段时间的基本R.

However, the reason this is awkward for me is probably just because I used base R for a while before learning stringr.

要考虑的另一点是,使用 stringi,您可以拥有更多的整体功能.因此,如果您下定决心要进行字符串操作,您可能会立即开始学习该软件包 - 尽管教程较少,而且弄清楚某些事情可能会更困难一些.

Another point to consider is that with stringi, you have even more features overall. So if you are determined to get into string manipulation, you might start to learn that package right away - although there are admittedly less tutorials and it might be a bit tougher to figure some things out.

这篇关于stringer 和 grepl 和 grep 中的 str_detect 函数有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆