从 R 中的向量中提取子串 [英] Substring extraction from vector in R

查看：45 发布时间：2021/7/6 19:29:59 regex r stringr

本文介绍了从 R 中的向量中提取子串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从非结构化文本中提取子字符串.例如，假设一个国家名称向量:

I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:

countries <- c("United States", "Israel", "Canada")

我如何传递这个字符值向量以从非结构化文本中提取精确匹配.

How do I go about passing this vector of character values to extract exact matches from unstructured text.

text.df <- data.frame(ID = c(1:5), 
text = c("United States is a match", "Not a match", "Not a match",
         "Israel is a match", "Canada is a match"))

在这个例子中，所需的输出是:

In this example, the desired output would be:

ID     text
1      United States
4      Israel
5      Canada

到目前为止，我一直在使用 gsub，在那里我删除所有不匹配项，然后删除然后删除具有空值的行.我也一直在使用 stringr 包中的 str_extract ，但没有成功地使正则表达式的参数正确.任何帮助将不胜感激！

So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!

推荐答案

1.字符串

我们可以首先使用indx"(通过折叠countries"向量形成)作为grep"中的模式对text.df"进行子集化，然后使用str_extract"从text"中获取模式元素列，将其分配给子集数据集 ('text.df1') 的 'text' 列

We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')

library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
#  ID          text
#1  1 United States
#4  4        Israel
#5  5        Canada

2.基础 R

不使用任何外部包，我们可以删除'ind'中发现的字符以外的字符

Without using any external packages, we can remove the characters other than those found in 'ind'

text.df1$text <- unlist(regmatches(text.df1$text, 
                           gregexpr(indx, text.df1$text)))

3.字符串

我们也可以使用 stringi

library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
#  ID         text1
#1  1 United States
#4  4        Israel
#5  5        Canada

这篇关于从 R 中的向量中提取子串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 R 中的向量中提取子串 [英] Substring extraction from vector in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从 R 中的向量中提取子串 [英] Substring extraction from vector in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭