从 R 中的列表匹配字符串 [英] string matching from list in R

查看:22
本文介绍了从 R 中的列表匹配字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 grep 在 R 中执行字符串匹配.我必须匹配 df1$ColA 到 df2$ColA我给出了以下输入和输出:

I am trying to perform string matching in R using grep. I have to match df1$ColA to df2$ColA I have given below inputs and outputs:

ColA
text1
text2
text3
text4
text5
text6
text7

df2:

ColA
text1 text2 text12
text23 text22 text7

中间输出:

ColA                    ColB
text1 text2 text12     text1, text2
text23 text22 text7    text7

最终输出:

ColA                ColB
text1 text2 text12   text1
text1 text2 text12   text2
text23 text22 text7  text7

方法:

我目前正在使用

test$test <- sapply(df2$ColA, function(x) ifelse(grep(paste(as.character(unlist(df1$ColA)),collapse="|"),x),1,0))

它会告诉我 df1$ColA 字符串是否与 df2$ColA 匹配但不会返回匹配的字符串.请指教.

It will give me if df1$ColA string is matching with df2$ColA but won't return matching strings. Please advice.

推荐答案

这是一个基于 match() 的半矢量化解决方案,它应该很快,并且能准确地生成您正在寻找的内容.匹配 df1$ColA 中项目的方法是对 df2$ColA 进行标记并将 df1$ColA 与每个标记匹配.然后它构建整个(原始)df2$ColA 元素的重复,并在输出中添加 df1$ColA 匹配作为 ColB.

Here's a semi-vectorised solution based on match() that should be fast and produce exactly what you are looking for. The way to match the items in df1$ColA is to tokenise the df2$ColA and match df1$ColA to each of the tokens. It then builds up a repetition of the entire (original) df2$ColA element, and adds the df1$ColA match as ColB in the output.

# set up the data, which the OP should have done
df1 <- data.frame(ColA = paste0("text", 1:7),
                  stringsAsFactors = FALSE)
df2 <- data.frame(ColA = c("text1 text2 text12",
                           "text23 text22 text7"),
                  stringsAsFactors = FALSE)

# create a matrix of matches of first to elements of second
matmatrix <- sapply(strsplit(df2$ColA, " "), match, df1$ColA)
# repeat original text in same length as potential match
origdfColArep <- rep(df2$ColA, each = nrow(matmatrix))

# create the results dataset, first the matches of the second part
result <- data.frame(ColA = origdfColArep[!is.na(as.vector(matmatrix))],
                     stringsAsFactors = FALSE)
# then add the matching first part
result$ColB <- df1$ColA[na.omit(as.vector(matmatrix))]

result
##                  ColA  ColB
## 1  text1 text2 text12 text1
## 2  text1 text2 text12 text2
## 3 text23 text22 text7 text7

这篇关于从 R 中的列表匹配字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆