R中的模糊匹配 [英] fuzzy matching in R
问题描述
我正在尝试检测带有名称向量的打开文本字段(读取:混乱!)之间的匹配.我创建了一个愚蠢的例子,突出了我的主要挑战.
I am trying to detect matches between an open text field (read: messy!) with a vector of names. I created a silly fruit example that highlights my main challenges.
df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
entry = c("Apple",
"I love apples",
"appls",
"Bannanas",
"banana",
"An apple a day keeps..."))
df1$entry <- as.character(df1$entry)
df2 <- data.frame(fruit=c("apple",
"banana",
"pineapple"),
code=c(11, 12, 13))
df2$fruit <- as.character(df2$fruit)
df1 %>%
mutate(match = str_detect(str_to_lower(entry),
str_to_lower(df2$fruit)))
如果可以的话,我的方法会抓住低谷的果实("Apple"和"banana"的完全匹配).
My approach grabs the low hanging fruit, if you will (exact matches for "Apple" and "banana").
# id entry match
#1 1 Apple TRUE
#2 2 I love apples FALSE
#3 3 appls FALSE
#4 4 Bannanas FALSE
#5 5 banana TRUE
#6 6 An apple a day keeps... FALSE
无与伦比的案例面临着不同的挑战:
The unmatched cases have different challenges:
- 案例2和案例6中的目标水果嵌入较大的字符串中.
- 3和4中的目标水果需要模糊匹配.
fuzzywuzzyR
软件包非常棒,并且做得很好(有关安装的详细信息,请参见页面python模块).
The fuzzywuzzyR
package is great and does a pretty good job (see page for details on installing python modules).
library(fuzzywuzzyR)
choices <- df2$fruit
word <- df1$entry[3] # "appls"
init_proc = FuzzUtils$new()
PROC = init_proc$Full_process
PROC1 = tolower
init_scor = FuzzMatcher$new()
SCOR = init_scor$WRATIO
init <- FuzzExtract$new()
init$Extract(string = word,
sequence_strings = choices,
processor = PROC,
scorer = SCOR)
此设置为苹果"(最高)返回80分.
This setup returns a score of 80 for "apple" (the highest).
除了fuzzywuzzyR
之外,还有其他方法可以考虑吗?您将如何解决这个问题?
Is there another approach to consider aside from fuzzywuzzyR
? How would you tackle this problem?
添加fuzzywuzzyR
输出:
[[1]]
[[1]][[1]]
[1] "apple"
[[1]][[2]]
[1] 80
[[2]]
[[2]][[1]]
[1] "pineapple"
[[2]][[2]]
[1] 72
[[3]]
[[3]][[1]]
[1] "banana"
[[3]][[2]]
[1] 18
推荐答案
我在今天回答问题时发现了此问题.所以我想回答原始问题.
I found this question referenced while answering a question today. So I thought of answering the original question.
library(dplyr)
library(fuzzyjoin)
df1 %>%
stringdist_left_join(df2, by=c(entry="fruit"), ignore_case=T, method="jw", distance_col="dist") %>%
group_by(entry) %>%
top_n(-1) %>%
select(-dist)
输出为:
id entry fruit code
<dbl> <fct> <fct> <dbl>
1 1.00 Apple apple 11.0
2 2.00 I love apples pineapple 13.0
3 3.00 appls apple 11.0
4 4.00 Bannanas banana 12.0
5 5.00 banana banana 12.0
6 6.00 An apple a day keeps... apple 11.0
样本数据:
df1 <- data.frame(id = c(1, 2, 3, 4, 5, 6),
entry = c("Apple", "I love apples", "appls", "Bannanas", "banana", "An apple a day keeps..."))
df2 <- data.frame(fruit=c("apple", "banana", "pineapple"), code=c(11, 12, 13))
这篇关于R中的模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!