R中的REGEX:从字符串中提取单词 [英] REGEX in R: extracting words from a string
问题描述
我想这是一个普遍的问题,我发现了很多网页,包括SO的一些网页,但是我不知道如何实现它.
i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.
我是REGEX的新手,我想在R中使用它来从句子中提取前几个单词.
I am new to REGEX, and I'd like to use it in R to extract the first few words from a sentence.
例如,如果我的句子是
z = "I love stack overflow it is such a cool site"
id喜欢让我的输出保持原状(如果我需要前四个单词)
id like to have my output as being (if i need the first four words)
[1] "I love stack overflow"
或者(如果我需要最后四个字)
or (if i need the last four words)
[1] "such a cool site"
当然,以下作品
paste(strsplit(z," ")[[1]][1:4],collapse=" ")
paste(strsplit(z," ")[[1]][7:10],collapse=" ")
但是我想尝试一个正则表达式解决方案来解决性能问题,因为我需要处理非常大的文件(并且也是为了了解它)
but i'd like to try a regex solution for performance issues as i need to deal with very huge files (and also for the sake of knowing about it)
我查看了几个链接,包括 Regex从字符串中提取前3个词,然后
I looked at several links, including Regex to extract first 3 words from a string and http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in-a-string.html
所以我尝试了
gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE)
Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"
我尝试了其他方法,但通常会返回整个字符串或空字符串.
i tried other stuff but it usually returned me either the whole string, or the empty string.
substr的另一个问题是它返回一个列表.也许[[]]
运算符在处理大型文件并执行应用程序时会使速度变慢(??).
another problem with substr is that it returns a list. maybe it looks like the [[]]
operator is slowing things a bit (??) when dealing with large files and doing apply stuff.
看起来R中使用的语法有些不同吗? 谢谢!
it looks like the Syntax used in R is somewhat different ? thanks !
推荐答案
您已经接受了答案,但是我将与您分享这个答案,以帮助您进一步了解R中的正则表达式,因为您实际上非常接近自己获得答案.
You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.
您的gsub
方法有两个问题:
-
您使用了单个反斜杠(
\
). R因为它们是特殊字符,所以要求您转义这些字符.您可以通过添加另一个反斜杠(\\
)来对其进行转义.如果您执行nchar("\\")
,则会看到它返回"1".
You used single backslashes (
\
). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\
). If you donchar("\\")
, you'll see that it returns "1".
您未指定替换内容.在这里,我们不想替换任何东西,但是我们想要捕获字符串的特定部分.您可以在括号(...)
中捕获组,然后可以通过组号来引用它们.在这里,我们只有一组,因此我们将其称为"\\1"
.
You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses (...)
, and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as "\\1"
.
您应该尝试过类似的操作:
You should have tried something like:
sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
# [1] "I love stack"
这实际上是在说:
- 从"z"的内容的开头开始.
- 开始创建组1.
- 查找非空格(如单词),然后查找两次(c8)两次的空格(
\S+\s+
),然后查找下一组非空格(\S+
).这将使我们获得3个单词,而在第三个单词之后也不会获得空格.因此,如果您希望使用不同数量的单词,请将{2}
更改为比您实际需要的单词少一个的数字. - 在这里结束第1组.
- 然后,仅从"z"返回组1(
\1
)的内容.
- Work from the start of the contents of "z".
- Start creating group 1.
- Find non-whitespace (like a word) followed by whitespace (
\S+\s+
) two times{2}
and then the next set of non-whitespaces (\S+
). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the{2}
to be one less than the number you are actually after. - End group 1 there.
- Then, just return the contents of group 1 (
\1
) from "z".
要获取最后三个单词,只需切换捕获组的位置,然后将其放在模式的末尾即可进行匹配.
To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.
sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
# [1] "a cool site"
这篇关于R中的REGEX:从字符串中提取单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!