R中的REGEX:从字符串中提取单词 [英] REGEX in R: extracting words from a string

查看:132
本文介绍了R中的REGEX:从字符串中提取单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想这是一个普遍的问题,我发现了很多网页,包括SO的一些网页,但是我不知道如何实现它.

i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.

我是REGEX的新手,我想在R中使用它来从句子中提取前几个单词.

I am new to REGEX, and I'd like to use it in R to extract the first few words from a sentence.

例如,如果我的句子是

z = "I love stack overflow it is such a cool site"

id喜欢让我的输出保持原状(如果我需要前四个单词)

id like to have my output as being (if i need the first four words)

[1] "I love stack overflow"

或者(如果我需要最后四个字)

or (if i need the last four words)

[1] "such a cool site"

当然,以下作品

paste(strsplit(z," ")[[1]][1:4],collapse=" ")
paste(strsplit(z," ")[[1]][7:10],collapse=" ")

但是我想尝试一个正则表达式解决方案来解决性能问题,因为我需要处理非常大的文件(并且也是为了了解它)

but i'd like to try a regex solution for performance issues as i need to deal with very huge files (and also for the sake of knowing about it)

我查看了几个链接,包括 Regex从字符串中提取前3个词,然后

I looked at several links, including Regex to extract first 3 words from a string and http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in-a-string.html

所以我尝试了

gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE)
Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"

我尝试了其他方法,但通常会返回整个字符串或空字符串.

i tried other stuff but it usually returned me either the whole string, or the empty string.

substr的另一个问题是它返回一个列表.也许[[]]运算符在处理大型文件并执行应用程序时会使速度变慢(??).

another problem with substr is that it returns a list. maybe it looks like the [[]] operator is slowing things a bit (??) when dealing with large files and doing apply stuff.

看起来R中使用的语法有些不同吗? 谢谢!

it looks like the Syntax used in R is somewhat different ? thanks !

推荐答案

您已经接受了答案,但是我将与您分享这个答案,以帮助您进一步了解R中的正则表达式,因为您实际上非常接近自己获得答案.

You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.

您的gsub方法有两个问题:

  1. 您使用了单个反斜杠(\). R因为它们是特殊字符,所以要求您转义这些字符.您可以通过添加另一个反斜杠(\\)来对其进行转义.如果您执行nchar("\\"),则会看到它返回"1".

  1. You used single backslashes (\). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\). If you do nchar("\\"), you'll see that it returns "1".

您未指定替换内容.在这里,我们不想替换任何东西,但是我们想要捕获字符串的特定部分.您可以在括号(...)中捕获组,然后可以通过组号来引用它们.在这里,我们只有一组,因此我们将其称为"\\1".

You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses (...), and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as "\\1".

您应该尝试过类似的操作:

You should have tried something like:

sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
# [1] "I love stack"

这实际上是在说:

  • 从"z"的内容的开头开始.
  • 开始创建组1.
  • 查找非空格(如单词),然后查找两次(c8)两次的空格(\S+\s+),然后查找下一组非空格(\S+).这将使我们获得3个单词,而在第三个单词之后也不会获得空格.因此,如果您希望使用不同数量的单词,请将{2}更改为比您实际需要的单词少一个的数字.
  • 在这里结束第1组.
  • 然后,仅从"z"返回组1(\1)的内容.
  • Work from the start of the contents of "z".
  • Start creating group 1.
  • Find non-whitespace (like a word) followed by whitespace (\S+\s+) two times {2} and then the next set of non-whitespaces (\S+). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the {2} to be one less than the number you are actually after.
  • End group 1 there.
  • Then, just return the contents of group 1 (\1) from "z".

要获取最后三个单词,只需切换捕获组的位置,然后将其放在模式的末尾即可进行匹配.

To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.

sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
# [1] "a cool site"

这篇关于R中的REGEX:从字符串中提取单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆