正则表达式与Fuzzyjoin/dplyr匹配 [英] regex match with fuzzyjoin / dplyr

查看:94
本文介绍了正则表达式与Fuzzyjoin/dplyr匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在第一列加入两个数据框,而忽略大小写:

I have two data frames that I want to join by the first column and to ignore the case:

df3<- data.frame("A" = c("XX28801","ZZ9"), "B" = c("one","two"),stringsAsFactors = FALSE)
df4<- data.frame("Z" = c("X2880","Zz9"),"C" = c("three", "four"), stringsAsFactors = FALSE)

这是我想要的:

df5<- data.frame(A = c("XX28801","ZZ9"), B = c("one","two"), Z = c(NA,"Zz9"), C = c(NA, "four"))

但是有趣的是,我使用Fuzzyjoin包得到了这一点:

but interestingly, I get this using the fuzzyjoin package:

join <- regex_left_join(df3,df4,by= c("A" = "Z"), ignore_case = TRUE)

ZZ9和Zz9匹配不错,但我不知道为什么XX28801与X2880匹配.唯一的相似之处是XX28801中的X2880.

It's good ZZ9 and Zz9 matched but I have no idea why XX28801 matched with X2880. The only similarity is the X2880 in XX28801.

我也不想在连接之前对值进行大写/小写,因为我希望A列和Z列保留其原始值.谢谢.

I also don't want to uppercase/lowercase the values before joining as I want column A and column Z to retain their original values. Thanks.

推荐答案

正则表达式联接基于正则表达式,此搜索器在左手表的文本内搜索右手表中的文本.因此为"X2880&"在"XX28801"中找到这被认为是匹配项.

Regex joins join on regular expressions, this searchers for the text in the right hand table within the text of the left hand table. So as "X2880" is found within "XX28801" this is considered a match.

为了更好地了解正则表达式,您可能会发现使用 grepl(pattern,text)探索一些比较会很有用,如果在文本中找到该模式,则返回true/false:

To understand regex better, you might find it useful to explore some comparisons using grepl(pattern, text) this returns true/false if the pattern is found within text:

> grepl('X2880', 'XX28801', ignore.case = TRUE)
[1] TRUE

似乎您只想在整个文本字符串与整个文本字符串匹配时才匹配,除了大写/小写字母.为此,我建议您创建要连接的临时列:

It seems like you want to match only when the entire text string matches the entire text string, other than capital/lowercase. For this I would recommend you create temporary columns to join on:

df3_w_lower = df3 %>%
  mutate(A_for_join = tolower(A))
df4_w_lower = df4 %>%
  mutate(Z_for_join = tolower(Z))

join = left_join(df3_w_lower, df4_w_lower, by = c("A_for_join" = "Z_for_join")) %>%
  select(-A_for_join, - Z_for_join)

通过使用临时列进行连接,可以将大写形式保留在原始列中.

By using temporary columns for joining you preserve the capitalization in the original columns.

这篇关于正则表达式与Fuzzyjoin/dplyr匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆