R grep:将一个字符串与多个模式匹配 [英] R grep: Match one string against multiple patterns

查看:119
本文介绍了R grep:将一个字符串与多个模式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 R 中,grep 通常将多个字符串的向量与一个正则表达式匹配.

In R, grep usually matches a vector of multiple strings against one regexp.

问:是否有可能将单个字符串与多个正则表达式匹配?(不遍历每个正则表达式模式)?

一些背景:

我有 7000 多个关键字作为多个类别的指标.我无法更改该关键字字典.字典结构如下(第1列的关键字,数字表示这些关键字所属的类别):

I have 7000+ keywords as indicators for several categories. I cannot change that keyword dictionary. The dictionary has following structure (keywords in col 1, numbers indicate categories where these keywords belong to):

ab  10  37  41
abbrach*    38
abbreche    39
abbrich*    39
abend*  37
abendessen* 60  63
aber    20  23  45
abermals    37

用|"连接这么多关键字不是一种可行的方法(我不知道哪个关键字产生了命中).此外,仅仅颠倒模式"和字符串"是行不通的,因为模式有截断,反之则行不通.

Concatenating so many keywords with "|" is not a feasible way (and I wouldn't know which of the keywords generated the hit). Also, just reversing "patterns" and "strings" does not work, as the patterns have truncations, which wouldn't work the other way round.

[相关问题,其他编程语言]

推荐答案

如何将 regexpr 函数应用于关键字向量?

What about applying the regexpr function over a vector of keywords?

keywords <- c("dog", "cat", "bird")

strings <- c("Do you have a dog?", "My cat ate by bird.", "Let's get icecream!")

sapply(keywords, regexpr, strings, ignore.case=TRUE)

     dog cat bird
[1,]  15  -1   -1
[2,]  -1   4   15
[3,]  -1  -1   -1

    sapply(keywords, regexpr, strings[1], ignore.case=TRUE)

 dog  cat bird 
  15   -1   -1 

返回的值是匹配中第一个字符的位置,-1 表示不匹配.

Values returned are the position of the first character in the match, with -1 meaning no match.

如果匹配的位置不相关,使用grepl代替:

If the position of the match is irrelevant, use grepl instead:

sapply(keywords, grepl, strings, ignore.case=TRUE)

       dog   cat  bird
[1,]  TRUE FALSE FALSE
[2,] FALSE  TRUE  TRUE
[3,] FALSE FALSE FALSE

更新:这在我的系统上运行得相对较快,即使有大量关键字:

Update: This runs relatively quick on my system, even with a large number of keywords:

# Available on most *nix systems
words <- scan("/usr/share/dict/words", what="")
length(words)
[1] 234936

system.time(matches <- sapply(words, grepl, strings, ignore.case=TRUE))

   user  system elapsed 
  7.495   0.155   7.596 

dim(matches)
[1]      3 234936

这篇关于R grep:将一个字符串与多个模式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆