R 函数来识别不匹配的行 [英] R Function to identify non-matching rows
问题描述
我正在尝试比较 2 个 data.frames,V1"代表我的 CRM,V2"代表我想发送的潜在客户.
I am trying to compare 2 data.frames, "V1" represents my CRM, "V2" represents Leads that I would like to send out.
'V1 大约有 8k 个元素''V2 大约有 25k 个元素'
'V1 has roughly 8k elements' 'V2 has roughly 25k elements'
我需要将 V2 中的每一行与 V1 中的每一行进行比较,丢弃 V1 中存在 V2 元素的每个实例.
I need to compare every row in V2 to every row in V1, discard every instance where a V2 element exists in V1.
然后,我只想将在 V1 中没有完全或松散出现的元素返回到 Leads 列中.
I would then like to return only the elements that do not appear either exactly or loosely in V1 into the Leads column.
目标是发送 CRM(V1) 中不存在的潜在客户 (V2).
The goal is to send out a lead(V2) that does not exist in CRM(V1).
我在 stringdist 包方面取得了一些不错的进展,并用 'osa' 划分了 'soundex' 以提高我的机会,尽管这种方法仍然返回 V1 中的元素.:(
I've made some good progress with the stringdist package and divided 'soundex' by 'osa' to better my chances although this method still returns elements in V1.:(
这是我在潜在客户栏中寻找的预期结果,基于此示例:
This is the expected result I'm looking for in the Leads column, based on this example:
线索:琼斯复辟A.W.建设者C&C 承包商
Leads: J.Jones Restoration A.W. Builders C&C Contractors
非常感谢任何帮助,如果有任何不清楚的地方,我深表歉意.
Any help would be greatly appreciated and I apologize if this is unclear in any way.
library(reprex)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
library(tidystringdist)
df <- tibble::tribble(
~V1, ~V2,
"5th Generation Builder", "5th Generation Builder, LLC",
"5th Generation Builders Inc.", "5th Generation Builders",
"89 Contractors LLC", "89 Contractors LLC",
"906 Studio Architects LLC", "906 Studio Architects",
"A & A Glass Co.", "Paragon Const.",
"A & E Farm", "A & E Farm",
"A & H GLASS", "C & C Contractors",
"A & J Homeworks,Painting, and Restoration", "A.W. Builders",
"Paragon Const.", "J. Jones Restoration",
"A & L Construction", "A & L Const.")
tidy_e <- tidy_stringdist(df) %>%
filter(soundex>=1) %>%
select(-V1, V2) %>%
arrange(V2,osa) %>%
mutate(V2, sim = soundex/ osa) %>%
distinct(V2, osa, soundex, sim) %>%
rename('Leads'= 'V2')
由 reprex 包 (v0.3.0) 于 2020 年 4 月 13 日创建上>
Created on 2020-04-13 by the reprex package (v0.3.0)
推荐答案
您可以使用 fuzzyjoin 包,专为基于不精确匹配(如字符串距离)连接表而设计.(免责声明是我是维护者).
You can use the fuzzyjoin package, designed for joining tables based on inexact matching such as string distance. (Disclaimer is that I'm the maintainer).
如果您将数据放在两个单独的表 V1 和 V2 中:
If you had your data in two separate tables V1 and V2:
V1 <- tibble(name = c("5th Generation Builder", "5th Generation Builders Inc.", "89 Contractors LLC",
"906 Studio Architects LLC", "A & A Glass Co.", "A & E Farm",
"A & H GLASS", "A & J Homeworks,Painting, and Restoration", "Paragon Const.",
"A & L Construction"))
V2 <- tibble(name = c("5th Generation Builder, LLC", "5th Generation Builders", "89 Contractors LLC",
"906 Studio Architects", "Paragon Const.", "A & E Farm", "C & C Contractors",
"A.W. Builders", "J. Jones Restoration", "A & L Const."))
然后您可以使用 stringdist_anti_join()
来查找 V2 中在 V1 中没有 soundex 匹配的那些:
Then you could use stringdist_anti_join()
to find the ones in V2 that don't have a soundex match in V1:
V2 %>%
stringdist_anti_join(V1, by = "name", method = "soundex")
结果:
# A tibble: 3 x 1
name
<chr>
1 C & C Contractors
2 A.W. Builders
3 J. Jones Restoration
请参阅此小插图,了解更多关于stringdist_
连接.
See this vignette for more on the stringdist_
joins.
请注意,如果您想查看每个匹配的,您可以使用stringdist_left_join()
:
Note that if you'd wanted to see which each matched to, you could use stringdist_left_join()
:
V2 %>%
stringdist_left_join(V1, by = "name", method = "soundex")
# A tibble: 12 x 2
name.x name.y
<chr> <chr>
1 5th Generation Builder, LLC 5th Generation Builder
2 5th Generation Builder, LLC 5th Generation Builders Inc.
3 5th Generation Builders 5th Generation Builder
4 5th Generation Builders 5th Generation Builders Inc.
5 89 Contractors LLC 89 Contractors LLC
6 906 Studio Architects 906 Studio Architects LLC
7 Paragon Const. Paragon Const.
8 A & E Farm A & E Farm
9 C & C Contractors NA
10 A.W. Builders NA
11 J. Jones Restoration NA
12 A & L Const. A & L Construction
这篇关于R 函数来识别不匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!