如何在R中执行近似(模糊)名称匹配 [英] How to perform approximate (fuzzy) name matching in R

查看:842
本文介绍了如何在R中执行近似(模糊)名称匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的数据集,专门用于生物期刊,由不同的人长期编写.因此,数据不是单一格式.例如,在作者"列中,我可以找到同一个人的约翰·史密斯,史密斯·约翰,史密斯·J等.我什至无法执行最简单的动作.例如,我不知道哪些作者写的文章最多.

I have a large data set, dedicated to biological journals, which was being composed for a long time by different people. So, the data are not in a single format. For example, in the column "AUTHOR" I can find John Smith, Smith John, Smith J and so on while it is the same person. I can not perform even the simplest actions. For example, I can't figure out what authors wrote the most articles.

R中是否有任何方法可以确定不同名称中的大多数符号是否相同,将它们视为相同的元素?

Is there any way in R to determine if the majority of symbols in the different names is the same, take them as the same elements?

推荐答案

有一些可以帮助您解决此问题的软件包,其中一些已在注释中列出.但是,如果您不想使用它们,尽管我会尝试用R编写一些可能对您有所帮助的东西.该代码将使"John Smith"与"J Smith","John Smith","Smith John","John S"匹配.同时,它不会匹配"John Sally"之类的东西.

There are packages that can help you with this, and some are listed in the comments. But, if you don't want to use these, I though I'd try to write something in R that might help you. The code will match "John Smith" with "J Smith", "John Smith", "Smith John", "John S". Meanwhile, it won't match something like "John Sally".

# generate some random names
names = c(
  "John Smith", 
  "Wigberht Ernust",
  "Samir Henning",
  "Everette Arron",
  "Erik Conor",
  "Smith J",
  "Smith John",
  "John S",
  "John Sally"
);

# split those names and get all ways to write that name
split_names = lapply(
  X = names,
  FUN = function(x){
    print(x);
    # split by a space
    c_split = unlist(x = strsplit(x = x, split = " "));
    # get both combinations of c_split to compensate for order
    c_splits = list(c_split, rev(x = c_split));
    # return c_splits
    c_splits;
  }
)

# suppose we're looking for John Smith
search_for = "John Smith";

# split it by " " and then find all ways to write that name
search_for_split = unlist(x = strsplit(x = x, split = " "));
search_for_split = list(search_for_split, rev(x = search_for_split));

# initialise a vector containing if search_for was matched in names
match_statuses = c();

# for each name that's been split
for(i in 1:length(x = names)){

  # the match status for the current name
  match_status = FALSE;

  # the current split name
  c_split_name = split_names[[i]];

  # for each element in search_for_split
  for(j in 1:length(x = search_for_split)){

    # the current combination of name
    c_search_for_split_names = search_for_split[[j]];

    # for each element in c_split_name
    for(k in 1:length(x = c_split_name)){

      # the current combination of current split name
      c_c_split_name = c_split_name[[k]];

      # if there's a match, or the length of grep (a pattern finding function is
      # greater than zero)
      if(
        # is c_search_for_split_names first element in c_c_split_name first
        # element
        length(
          x = grep(
            pattern = c_search_for_split_names[1],
            x = c_c_split_name[1]
          )
        ) > 0 &&
        # is c_search_for_split_names second element in c_c_split_name second 
        # element
        length(
          x = grep(
            pattern = c_search_for_split_names[2],
            x = c_c_split_name[2]
          )
        ) > 0 ||
        # or, is c_c_split_name first element in c_search_for_split_names first 
        # element
        length(
          x = grep(
            pattern = c_c_split_name[1],
            x = c_search_for_split_names[1]
          )
        ) > 0 &&
        # is c_c_split_name second element in c_search_for_split_names second 
        # element
        length(
          x = grep(
            pattern = c_c_split_name[2],
            x = c_search_for_split_names[2]
          )
        ) > 0
      ){
        # if this is the case, update match status to TRUE
        match_status = TRUE;
      } else {
        # otherwise, don't update match status
      }
    }
  }

  # append match_status to the match_statuses list
  match_statuses = c(match_statuses, match_status);
}

search_for;

[1] "John Smith"

cbind(names, match_statuses);

     names             match_statuses
[1,] "John Smith"      "TRUE"        
[2,] "Wigberht Ernust" "FALSE"       
[3,] "Samir Henning"   "FALSE"       
[4,] "Everette Arron"  "FALSE"       
[5,] "Erik Conor"      "FALSE"       
[6,] "Smith J"         "TRUE"        
[7,] "Smith John"      "TRUE"        
[8,] "John S"          "TRUE"
[9,] "John Sally"      "FALSE"   

希望此代码可以作为起点,您可能希望对其进行调整以使用任意长度的名称.

Hopefully this code can serve as a starting point, and you may wish to adjust it to work with names of arbitrary length.

一些注意事项:

    R中的
  • for循环可能很慢.如果要处理很多名称,请查看Rcpp.

  • for loops in R can be slow. If you're working with lots of names, look into Rcpp.

您可能希望将其包装在函数中.然后,您可以通过调整search_for将其应用于其他名称.

You may wish to wrap this in a function. Then, you can apply this for different names by adjusting search_for.

此示例存在时间复杂性问题,并且根据数据的大小,您可能希望/需要对其进行重新加工.

There are time complexity issues with this example, and depending on the size of your data, you may want/need to rework it.

这篇关于如何在R中执行近似(模糊)名称匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆