R中的模糊匹配 [英] fuzzy matching in R

查看:641
本文介绍了R中的模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试检测带有名称向量的打开文本字段(读取:混乱!)之间的匹配.我创建了一个愚蠢的例子,突出了我的主要挑战.

I am trying to detect matches between an open text field (read: messy!) with a vector of names. I created a silly fruit example that highlights my main challenges.

df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
              entry = c("Apple", 
                        "I love apples", 
                        "appls",
                        "Bannanas",
                        "banana", 
                        "An apple a day keeps..."))
df1$entry <- as.character(df1$entry)

df2 <- data.frame(fruit=c("apple",
                          "banana",
                          "pineapple"),
                  code=c(11, 12, 13))
df2$fruit <- as.character(df2$fruit)

df1 %>%
  mutate(match = str_detect(str_to_lower(entry), 
                            str_to_lower(df2$fruit)))

如果可以的话,我的方法会抓住低谷的果实("Apple"和"banana"的完全匹配).

My approach grabs the low hanging fruit, if you will (exact matches for "Apple" and "banana").

#  id                   entry match
#1  1                   Apple  TRUE
#2  2           I love apples FALSE
#3  3                   appls FALSE
#4  4                Bannanas FALSE
#5  5                  banana  TRUE
#6  6 An apple a day keeps... FALSE

无与伦比的案例面临着不同的挑战:

The unmatched cases have different challenges:

  1. 案例2和案例6中的目标水果嵌入较大的字符串中.
  2. 3和4中的目标水果需要模糊匹配.

fuzzywuzzyR 软件包非常棒,并且做得很好(有关安装的详细信息,请参见页面python模块).

The fuzzywuzzyR package is great and does a pretty good job (see page for details on installing python modules).

library(fuzzywuzzyR)
choices <- df2$fruit
word <- df1$entry[3]  # "appls"

init_proc = FuzzUtils$new()      
PROC = init_proc$Full_process    
PROC1 = tolower                  

init_scor = FuzzMatcher$new()    
SCOR = init_scor$WRATIO          

init <- FuzzExtract$new()        

init$Extract(string = word, 
             sequence_strings = choices, 
             processor = PROC, 
             scorer = SCOR)

此设置为苹果"(最高)返回80分.

This setup returns a score of 80 for "apple" (the highest).

除了fuzzywuzzyR之外,还有其他方法可以考虑吗?您将如何解决这个问题?

Is there another approach to consider aside from fuzzywuzzyR? How would you tackle this problem?

添加fuzzywuzzyR输出:

[[1]]
[[1]][[1]]
[1] "apple"

[[1]][[2]]
[1] 80


[[2]]
[[2]][[1]]
[1] "pineapple"

[[2]][[2]]
[1] 72


[[3]]
[[3]][[1]]
[1] "banana"

[[3]][[2]]
[1] 18

推荐答案

我在今天回答问题时发现了此问题.所以我想回答原始问题.

I found this question referenced while answering a question today. So I thought of answering the original question.

library(dplyr)
library(fuzzyjoin)

df1 %>%
  stringdist_left_join(df2, by=c(entry="fruit"), ignore_case=T, method="jw", distance_col="dist") %>%
  group_by(entry) %>%
  top_n(-1) %>%
  select(-dist)

输出为:

     id entry                   fruit      code
  <dbl> <fct>                   <fct>     <dbl>
1  1.00 Apple                   apple      11.0
2  2.00 I love apples           pineapple  13.0
3  3.00 appls                   apple      11.0
4  4.00 Bannanas                banana     12.0
5  5.00 banana                  banana     12.0
6  6.00 An apple a day keeps... apple      11.0

样本数据:

df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
                  entry = c("Apple", "I love apples", "appls", "Bannanas", "banana", "An apple a day keeps..."))
df2 <- data.frame(fruit=c("apple", "banana", "pineapple"), code=c(11, 12, 13))

这篇关于R中的模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆