R:Regex_Join/Fuzzy_Join - 以不同的词序连接不精确的字符串 [英] R: Regex_Join/Fuzzy_Join - Join Inexact Strings in Different Word Orders

查看:31
本文介绍了R:Regex_Join/Fuzzy_Join - 以不同的词序连接不精确的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

df1

df2

df3

library(dplyr)
library(fuzzyjoin)
df1  <- tibble(a =c("Apple Pear Orange", "Sock Shoe Hat", "Cat Mouse Dog"))
df2  <- tibble(b =c("Kiwi Lemon Apple", "Shirt Sock Glove", "Mouse Dog"),
               c = c("Fruit", "Clothes", "Animals"))
# Appends 'Animals'
df3 <-  regex_left_join(df1,df2, c("a" = "b"))
# Appends Nothing
df3 <-  stringdist_left_join(df1, df2,  by = c("a" = "b"), max_dist = 3, method = "lcs")

我想使用字符串将 df2 的 c 列附加到 df1,苹果"、袜子"和老鼠狗".

I want to append column c of df2 to df1 using the strings, 'Apple', 'Sock' and 'Mouse Dog'.

我尝试使用 regex_joinfuzzyjoin 执行此操作,但是字符串的顺序似乎很重要,并且似乎无法找到解决方法.

I tried doing this with regex_join and fuzzyjoin but the order of the string seems to matter, and can't seem to find a way around it.

推荐答案

regex_left_join 有效,但它不只是寻找任何相似之处.正如描述中所说,

regex_left_join works, but it isn't just looking for any similarities. As it says in the description,

通过另一个表中的正则表达式列加入一个带有字符串列的表

所以,我们需要提供一个正则表达式模式.如果 df2$b 包含单独的感兴趣的单词,我们可以这样做

So, we need to provide a regex pattern. If df2$b contains separate words of interest, we may do

(df2$regex <- gsub(" ", "|", df2$b))
# [1] "Kiwi|Lemon|Apple" "Shirt|Sock|Glove" "Mouse|Dog"      

然后

regex_left_join(df1, df2, by = c(a = "regex"))[-ncol(df1) - ncol(df2)]
# A tibble: 3 x 3
#   a                 b                c      
#   <chr>             <chr>            <chr>  
# 1 Apple Pear Orange Kiwi Lemon Apple Fruit  
# 2 Sock Shoe Hat     Shirt Sock Glove Clothes
# 3 Cat Mouse Dog     Mouse Dog        Animals

其中 -ncol(df1) - ncol(df2) 只是删除包含正则表达式模式的最后一列.

where -ncol(df1) - ncol(df2) simply drops the last column containing the regex patterns.

这篇关于R:Regex_Join/Fuzzy_Join - 以不同的词序连接不精确的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆