R 函数来识别不匹配的行 [英] R Function to identify non-matching rows

查看:23
本文介绍了R 函数来识别不匹配的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试比较 2 个 data.frames,V1"代表我的 CRM,V2"代表我想发送的潜在客户.

I am trying to compare 2 data.frames, "V1" represents my CRM, "V2" represents Leads that I would like to send out.

'V1 大约有 8k 个元素''V2 大约有 25k 个元素'

'V1 has roughly 8k elements' 'V2 has roughly 25k elements'

我需要将 V2 中的每一行与 V1 中的每一行进行比较,丢弃 V1 中存在 V2 元素的每个实例.

I need to compare every row in V2 to every row in V1, discard every instance where a V2 element exists in V1.

然后,我只想将在 V1 中没有完全或松散出现的元素返回到 Leads 列中.

I would then like to return only the elements that do not appear either exactly or loosely in V1 into the Leads column.

目标是发送 CRM(V1) 中不存在的潜在客户 (V2).

The goal is to send out a lead(V2) that does not exist in CRM(V1).

我在 stringdist 包方面取得了一些不错的进展,并用 'osa' 划分了 'soundex' 以提高我的机会,尽管这种方法仍然返回 V1 中的元素.:(

I've made some good progress with the stringdist package and divided 'soundex' by 'osa' to better my chances although this method still returns elements in V1.:(

这是我在潜在客户栏中寻找的预期结果,基于此示例:

This is the expected result I'm looking for in the Leads column, based on this example:

线索:琼斯复辟A.W.建设者C&C 承包商

Leads: J.Jones Restoration A.W. Builders C&C Contractors

非常感谢任何帮助,如果有任何不清楚的地方,我深表歉意.

Any help would be greatly appreciated and I apologize if this is unclear in any way.

library(reprex)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
library(tidystringdist)

df <- tibble::tribble(
  ~V1,  ~V2,
  "5th Generation Builder", "5th Generation Builder, LLC",
  "5th Generation Builders Inc.",   "5th Generation Builders",
  "89 Contractors LLC", "89 Contractors LLC",
  "906 Studio Architects LLC",  "906 Studio Architects",
  "A & A Glass Co.",    "Paragon Const.",
  "A & E Farm", "A & E Farm",
  "A & H GLASS",    "C & C Contractors",
  "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
  "Paragon Const.", "J. Jones Restoration",
  "A & L Construction", "A & L Const.")

tidy_e <- tidy_stringdist(df) %>% 
  filter(soundex>=1) %>% 
  select(-V1, V2) %>% 
  arrange(V2,osa) %>% 
  mutate(V2, sim = soundex/ osa) %>% 
  distinct(V2, osa, soundex, sim) %>% 
  rename('Leads'= 'V2')

reprex 包 (v0.3.0) 于 2020 年 4 月 13 日创建

Created on 2020-04-13 by the reprex package (v0.3.0)

推荐答案

您可以使用 fuzzyjoin 包,专为基于不精确匹配(如字符串距离)连接表而设计.(免责声明是我是维护者).

You can use the fuzzyjoin package, designed for joining tables based on inexact matching such as string distance. (Disclaimer is that I'm the maintainer).

如果您将数据放在两个单独的表 V1 和 V2 中:

If you had your data in two separate tables V1 and V2:

V1 <- tibble(name = c("5th Generation Builder", "5th Generation Builders Inc.", "89 Contractors LLC", 
                      "906 Studio Architects LLC", "A & A Glass Co.", "A & E Farm", 
                      "A & H GLASS", "A & J Homeworks,Painting, and Restoration", "Paragon Const.", 
                      "A & L Construction"))

V2 <- tibble(name = c("5th Generation Builder, LLC", "5th Generation Builders", "89 Contractors LLC", 
                      "906 Studio Architects", "Paragon Const.", "A & E Farm", "C & C Contractors", 
                      "A.W. Builders", "J. Jones Restoration", "A & L Const."))

然后您可以使用 stringdist_anti_join() 来查找 V2 中在 V1 中没有 soundex 匹配的那些:

Then you could use stringdist_anti_join() to find the ones in V2 that don't have a soundex match in V1:

V2 %>%
  stringdist_anti_join(V1, by = "name", method = "soundex")

结果:

# A tibble: 3 x 1
  name                
  <chr>               
1 C & C Contractors   
2 A.W. Builders       
3 J. Jones Restoration

请参阅此小插图,了解更多关于stringdist_ 连接.

See this vignette for more on the stringdist_ joins.

请注意,如果您想查看每个匹配的,您可以使用stringdist_left_join():

Note that if you'd wanted to see which each matched to, you could use stringdist_left_join():

V2 %>%
  stringdist_left_join(V1, by = "name", method = "soundex")

# A tibble: 12 x 2
   name.x                      name.y                      
   <chr>                       <chr>                       
 1 5th Generation Builder, LLC 5th Generation Builder      
 2 5th Generation Builder, LLC 5th Generation Builders Inc.
 3 5th Generation Builders     5th Generation Builder      
 4 5th Generation Builders     5th Generation Builders Inc.
 5 89 Contractors LLC          89 Contractors LLC          
 6 906 Studio Architects       906 Studio Architects LLC   
 7 Paragon Const.              Paragon Const.              
 8 A & E Farm                  A & E Farm                  
 9 C & C Contractors           NA                          
10 A.W. Builders               NA                          
11 J. Jones Restoration        NA                          
12 A & L Const.                A & L Construction          

这篇关于R 函数来识别不匹配的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆