R中近似子串匹配项的位置 [英] Position of Approximate Substring Matches in R

查看:132
本文介绍了R中近似子串匹配项的位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R进行字符串处理.我有一个带有一列字符串的数据框,说:

I'm using R for string processing. I have a data frame with a column of strings, say:

 df <- data.frame(textcol=c("In this substring would like to find the position of this substring",
 "I would also like to find the position of thes substring",
 "No match here","No mention of this substrangy thing"))

 matchPattern <- "this substring"

我正在搜索一个函数(取决于某种距离参数,比如Jarro-Winkler),它将使用我的matchPattern,将其与数据框文本列的每一行进行比较,然后返回匹配的确切位置在匹配的字符串中,即第一个元素为36(除非我算错了),第二个元素(可能为43),第三个元素为NA,第四个元素为14(?).

I am searching for a function that (depending on a distance parameter of some sort, say Jarro-Winkler) would take my matchPattern, compare it to every row of the data frame text column, and return the exact position of the match within the matched string, i.e. 36 (unless I miscounted) for the first element, and (perhaps) 43 for the second, NA for the third and 14 (?) for the the fourth.

推荐答案

您可以使用aregexec

## Get positions (-1 instead of NA)
positions <- aregexec(matchPattern, df$textcol, max.distance = 0.1)
unlist(positions)
# [1] 38 43 -1 15

## Extract matches
regmatches(df$textcol, positions)
# [[1]]
# [1] "this substring"
# 
# [[2]]
# [1] "thes substring"
# 
# [[3]]
# character(0)
# 
# [[4]]
# [1] "this substrang"

编辑

## A possibilty for replacing matches, or maybe `regmatches<-`
res <- regmatches(df$textcol, positions)
res[lengths(res)==0] <- "XXXX"  # deal with 0 length matches somehow
df$out <- Vectorize(gsub)(unlist(res), "Censored", df$textcol)
df$out
# [1] "I would like to find the position of Censored"     
# [2] "I would also like to find the position of Censored"
# [3] "No match here"                                     
# [4] "No mention of Censoredy thing"                     

这篇关于R中近似子串匹配项的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆