从 R 中的向量中提取子串 [英] Substring extraction from vector in R

查看:45
本文介绍了从 R 中的向量中提取子串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从非结构化文本中提取子字符串.例如,假设一个国家名称向量:

I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:

countries <- c("United States", "Israel", "Canada")

我如何传递这个字符值向量以从非结构化文本中提取精确匹配.

How do I go about passing this vector of character values to extract exact matches from unstructured text.

text.df <- data.frame(ID = c(1:5), 
text = c("United States is a match", "Not a match", "Not a match",
         "Israel is a match", "Canada is a match"))

在这个例子中,所需的输出是:

In this example, the desired output would be:

ID     text
1      United States
4      Israel
5      Canada

到目前为止,我一直在使用 gsub,在那里我删除所有不匹配项,然后删除然后删除具有空值的行.我也一直在使用 stringr 包中的 str_extract ,但没有成功地使正则表达式的参数正确.任何帮助将不胜感激!

So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!

推荐答案

1.字符串

我们可以首先使用indx"(通过折叠countries"向量形成)作为grep"中的模式对text.df"进行子集化,然后使用str_extract"从text"中获取模式元素列,将其分配给子集数据集 ('text.df1') 的 'text' 列

We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')

library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
#  ID          text
#1  1 United States
#4  4        Israel
#5  5        Canada

2.基础 R

不使用任何外部包,我们可以删除'ind'中发现的字符以外的字符

Without using any external packages, we can remove the characters other than those found in 'ind'

text.df1$text <- unlist(regmatches(text.df1$text, 
                           gregexpr(indx, text.df1$text)))

3.字符串

我们也可以使用 stringi

library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
#  ID         text1
#1  1 United States
#4  4        Israel
#5  5        Canada

这篇关于从 R 中的向量中提取子串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆