r 中的模糊字符串匹配 [英] Fuzzy string matching in r

查看:64
本文介绍了r 中的模糊字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个数据集,每个数据集超过 10 万行.我想基于匹配一列('电影标题')的模糊字符串以及使用发布日期来合并它们.我提供了来自以下两个数据集的样本.

I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.

数据集-1

itemid userid rating       time                              title release_date
99991    1673    835      3 1998-03-27                             mirage         1995
99992    1674    840      4 1998-03-29                         mamma roma         1962
99993    1675    851      3 1998-01-08                     sunchaser, the         1996
99994    1676    851      2 1997-10-01                   war at home, the         1996
99995    1677    854      3 1997-12-22                      sweet nothing         1995
99996    1678    863      1 1998-03-07                         mat' i syn         1997
99997    1679    863      3 1998-03-07                          b. monkey         1998
99998    1680    863      2 1998-03-07                      sliding doors         1998
99999    1681    896      3 1998-02-11                       you so crazy         1994
100000   1682    916      3 1997-11-29 scream of stone (schrei aus stein)         1991

数据集 - 2

itemid userid rating       time                                   title release_date
1    2844   4477      3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine         1913
2    4936   8871      4 2013-05-05                                the bank         1915
3    4936  11628      3 2013-07-06                                the bank         1915
4    4972  16885      4 2013-08-19                   the birth of a nation         1915
5    5078  11628      2 2013-08-23                               the cheat         1915
6    6684   4222      3 2013-08-24                             the fireman         1916
7    6689   4222      3 2013-08-24                         the floorwalker         1916
8    7264   2092      4 2013-03-17                                the rink         1916
9    7264   5943      3 2013-05-12                                the rink         1916
10   7880  11628      4 2013-07-19                             easy street         1917

我看过agrep",但它一次只匹配一个字符串.'stringdist' 函数很好,但您需要在循环中运行它,找到最小距离,然后进行进一步的进动,考虑到数据集的大小,这非常耗时.由于需要模糊匹配,字符串可能有拼写错误和特殊字符.我环顾四周,发现了Lenenshtein"和Jaro-Winkler"方法.当你在字符串中有拼写错误时,我读的越晚越好.

I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.

在这种情况下,只有模糊匹配可能不会提供好的结果,例如,一个数据集中的电影标题玩具总动员"可以与另一个数据集中的玩具总动员 2"匹配,但这是不正确的.所以我需要考虑上映日期,以确保匹配的电影是独一无二的.

In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.

我想知道是否有一种方法可以在不使用循环的情况下完成此任务?更糟糕的情况是,如果我必须使用循环,我怎样才能让它尽可能高效地工作.

I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.

我已经尝试了以下代码,但它花费了大量的时间来处理.

I have tried the following code but it has taken an awful amount of time to process.

for(i in 1:nrow(test))
  for(j in 1:nrow(test1))
  {

    test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
                      test$title, NA)
  }

test - 包含 1682 个转换为小写的唯一电影名称test1 - 包含 11451 个转换为小写的唯一电影名称

test - contains 1682 unique movie names converted to lower case test1 - contains 11451 unique movie names converted to lower case

有没有办法避免 for 循环并使其工作得更快?

Is there a way to avoid the for loops and make it work faster?

推荐答案

这种方法如何推动您前进?看到结果后,您可以从 0.85 调整匹配度.然后,您可以使用 dplyr 按匹配的标题分组并通过减去发布日期进行总结.任何零都表示相同的发布日期.

What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.

dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

这篇关于r 中的模糊字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆