用正则表达式提取语法 [英] r ngram extraction with regex

查看：152 发布时间：2020/7/10 2:06:04 regex r stringi

本文介绍了用正则表达式提取语法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Karl Broman的帖子: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/让我玩正则表达式和ngram只是为了好玩.我试图使用正则表达式提取2克.我知道有解析器可以执行此操作，但是对正则表达式逻辑很感兴趣(即我未能满足自己的挑战).

Karl Broman's post: https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ got me playing with regex and ngrams just for fun. I attempted to use regex to extract 2-grams. I know there are parsers to do this but am interested in the regex logic (i.e., it was a self challenge that I failed to meet).

下面，我给出一个最小的示例和所需的输出.我尝试的问题是2折:

Below I give a minimal example and the desired output. The problem in my attempt is 2 fold:

这些克(单词)被吃掉了，无法用于下一次通行证. 如何使它们可用于第二遍?(例如，我希望like toast在先前已在I like中消费后才可用于like toast)

The grams (words) get eaten up and aren't available for the next pass. How can I make them available for the second pass? (e.g., I want like to be available for like toast after it's already been consumed previously in I like)

我无法使单词之间的空格不被捕获(即使使用(?:\\s*)，也请注意输出中的尾部空白). 我怎么不捕获第n个(在本例中为第二个)单词的尾随空格?我知道这可以简单地通过以下方式完成:"(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)"表示2克，但我想将解决方案扩展到克PS我了解\\w，但我不将下划线和数字视为单词部分，但确实将'作为单词部分.

I couldn't make the space between words non-captured (notice the trailing white space in my output even though I used (?:\\s*)). How can I not capture trailing spaces on the nth (in this case second) word? I know this could be done simply with: "(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)" for a 2-gram but I want to extend the solution to n-grams. PS I know about \\w but I don't consider underscores and numbers as word parts, but do consider ' as a word part.

MWE:

library(stringi)

x <- "I like toast and jam."

stringi::stri_extract_all_regex(
    x,
    pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)

## [[1]]
## [1] "I like "    "toast and "

所需的输出:

Desired Output:

## [[1]]
## [1] "I like"  "like toast"    "toast and"  "and jam"

推荐答案

这是使用基本R正则表达式的一种方法.这可以轻松扩展为处理任意n-gram.诀窍是将捕获组放入正面的前瞻性断言中，例如(?=(my_overlapping_pattern))

Here's one way using base R regex. This can be easily extended to handle arbitrary n-grams. The trick is to put the capture group inside a positive look-ahead assertion, eg., (?=(my_overlapping_pattern))

x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)

# [[1]]
# [1] "I like"     "like toast" "toast and"  "and jam"

这篇关于用正则表达式提取语法的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用正则表达式提取语法 [英] r ngram extraction with regex

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用正则表达式提取语法 [英] r ngram extraction with regex

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭