从 R 中的向量中提取子串 [英] Substring extraction from vector in R
问题描述
我正在尝试从非结构化文本中提取子字符串.例如,假设一个国家名称向量:
I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:
countries <- c("United States", "Israel", "Canada")
我如何传递这个字符值向量以从非结构化文本中提取精确匹配.
How do I go about passing this vector of character values to extract exact matches from unstructured text.
text.df <- data.frame(ID = c(1:5),
text = c("United States is a match", "Not a match", "Not a match",
"Israel is a match", "Canada is a match"))
在这个例子中,所需的输出是:
In this example, the desired output would be:
ID text
1 United States
4 Israel
5 Canada
到目前为止,我一直在使用 gsub
,在那里我删除所有不匹配项,然后删除然后删除具有空值的行.我也一直在使用 stringr 包中的 str_extract
,但没有成功地使正则表达式的参数正确.任何帮助将不胜感激!
So far I have been working with gsub
by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract
from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!
推荐答案
1.字符串
我们可以首先使用indx"(通过折叠countries"向量形成)作为grep"中的模式对text.df"进行子集化,然后使用str_extract"从text"中获取模式元素列,将其分配给子集数据集 ('text.df1') 的 'text' 列
We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')
library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
# ID text
#1 1 United States
#4 4 Israel
#5 5 Canada
2.基础 R
不使用任何外部包,我们可以删除'ind'中发现的字符以外的字符
Without using any external packages, we can remove the characters other than those found in 'ind'
text.df1$text <- unlist(regmatches(text.df1$text,
gregexpr(indx, text.df1$text)))
3.字符串
我们也可以使用 stringi
library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
# ID text1
#1 1 United States
#4 4 Israel
#5 5 Canada
这篇关于从 R 中的向量中提取子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!