Python中的模糊字符串匹配 [英] Fuzzy string matching in Python

查看:201
本文介绍了Python中的模糊字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个超过一百万个名称的列表,它们的命名约定略有不同。这里的目标是匹配具有95%置信度的逻辑的相似记录。

I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence.

我意识到我可以利用某些库,例如

I am made aware there are libraries which I can leverage on, such as the FuzzyWuzzy module in Python.

但是在处理方面,似乎一个列表中的每个字符串要与另一个列表进行比较会占用太多资源。这种情况似乎需要一百万乘以另一百万的迭代次数。

However in terms of processing it seems it will take up too much resources having every string in 1 list to be compared to the other, which in this case seems to require 1 million multiplied by another million number of iterations.

还有其他更有效的方法来解决此问题吗?

Are there any other more efficient methods for this problem?

更新:

所以我创建了一个存储桶函数,并应用了一个简单的规范化功能,删除了空格,符号并将值转换为小写等。 / p>

So I created a bucketing function and applied a simple normalization of removing whitespace, symbols and converting the values to lowercase etc...

for n in list(dftest['YM'].unique()):
    n = str(n)
    frame = dftest['Name'][dftest['YM'] == n]
    print len(frame)
    print n
    for names in tqdm(frame):
            closest = process.extractOne(names,frame)

通过使用pythons pandas,将数据加载到按年份分组的较小存储桶中,然后使用FuzzyWuzzy模块 process.extractOne 用于获得最佳匹配。

By using pythons pandas, the data is loaded to smaller buckets grouped by years and then using the FuzzyWuzzy module, process.extractOne is used to get the best match.

结果仍然有些令人失望。在测试过程中,上面的代码用于仅包含5,000个名称的测试数据帧上,并且占用了几乎一个小时的时间。

Results are still somewhat disappointing. During test the code above is used on a test data frame containing only 5 thousand names and takes up almost a whole hour.

测试数据被分割。


  • 姓名

  • 出生日期的月份

我正在按它们的YM在同一存储桶中的存储桶进行比较。

And I am comparing them by buckets where their YMs are in the same bucket.

问题可能是由于FuzzyWuzzy我正在使用的模块?感谢任何帮助。

Could the problem be because of the FuzzyWuzzy module I am using? Appreciate any help.

推荐答案

这里有多种优化级别可以将这个问题从O(n ^ 2)转化为a时间复杂度更低。

There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity.


  • 预处理:在第一遍中对列表进行排序,创建输出映射对于每个字符串,它们的映射键都可以是规范化的字符串。
    规范化可能包括:

  • Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string. Normalizations may include:

  • lowercase conversion,
  • no whitespaces, special characters removal,
  • transform unicode to ascii equivalents if possible,use unicodedata.normalize or unidecode module )

这将导致 Andrew H Smith andrew h。 smith ándréwh。smith 生成相同的密钥 andrewhsmith ,以及会将您的数百万个名称减少为较小的一组唯一/相似的分组名称。

This would result in "Andrew H Smith", "andrew h. smith", "ándréw h. smith" generating same key "andrewhsmith", and would reduce your set of million names to a smaller set of unique/similar grouped names.

您可以使用此< a href = https://github.com/dhruvpathak/misc-python-utils/blob/master/helpers.py#L145 rel = noreferrer>通用方法以标准化您的字符串(不包括unicode部分):

You can use this utlity method to normalize your string (does not include the unicode part though) :

def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
    """ Processes string for similarity comparisons , cleans special characters and extra whitespaces
        if normalized is True and removes the substrings which are in ignore_list)
    Args:
        input_str (str) : input string to be processed
        normalized (bool) : if True , method removes special characters and extra whitespace from string,
                            and converts to lowercase
        ignore_list (list) : the substrings which need to be removed from the input string
    Returns:
       str : returns processed string
    """
    for ignore_str in ignore_list:
        input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)

    if normalized is True:
        input_str = input_str.strip().lower()
        #clean special chars and extra whitespace
        input_str = re.sub("\W", "", input_str).strip()

    return input_str




  • 如果相似的字符串的规范化密钥相同,那么现在它们将已经位于同一存储桶中。

    • Now similar strings will already lie in the same bucket if their normalized key is same.

      为进一步比较,您将需要仅比较键,而不是名称。例如
      andrewhsmith andrewhsmeeth ,因为名称的相似性
      将需要模糊字符串匹配

      For further comparison, you will need to compare the keys only, not the names. e.g andrewhsmith and andrewhsmeeth, since this similarity of names will need fuzzy string matching apart from the normalized comparison done above.

      装箱您真的需要比较5个字符吗?键和9个字符的键是否匹配95%?你不可以。
      这样您就可以创建与字符串匹配的存储桶。例如5个字符名称将与4-6个字符名称相匹配,6个字符名称与5-7个字符相匹配,等等。对于大多数实际匹配而言,字符键的n + 1,n-1个字符限制是一个不错的选择。

      Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? No you do not. So you can create buckets of matching your strings. e.g. 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for a n character key is a reasonably good bucket for most practical matching.

      开始匹配:大多数名称的变体在标准化格式中都具有相同的第一个字符(例如 Andrew H Smith 安德烈·史密斯 Andrew H.Smeeth 生成密钥 andrewhsmith andrewhsmith andrewhsmeeth
      通常不会第一个字符有所不同,因此您可以将以 a 开头的键与以 a 开头的其他键进行匹配,并落入长度桶之内,这将大大减少您的匹配时间。无需将键 andrewhsmith bndrewhsmith 匹配因为这样的名字首字母缩写几乎不会存在。

      Beginning match : Most variations of names will have same first character in the normalized format ( e.g Andrew H Smith, ándréw h. smith, and Andrew H. Smeeth generate keys andrewhsmith,andrewhsmith, and andrewhsmeeth respectively. They will usually not differ in the first character, so you can run matching for keys starting with a to other keys which start with a, and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith to bndrewhsmith as such a name variation with first letter will rarely exist.

      然后您可以使用somet遵循此方法(或FuzzyWuzzy模块)以查找字符串相似度百分比,您可以排除 jaro_winkler 或difflib之一以优化速度和结果质量:

      Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality:

      def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
          """ Calculates matching ratio between two strings
          Args:
              first_str (str) : First String
              second_str (str) : Second String
              normalized (bool) : if True ,method removes special characters and extra whitespace
                                  from strings then calculates matching ratio
              ignore_list (list) : list has some characters which has to be substituted with "" in string
          Returns:
             Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
                          using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
                          equal weightage to each
          Examples:
              >>> find_string_similarity("hello world","Hello,World!",normalized=True)
              1.0
              >>> find_string_similarity("entrepreneurship","entreprenaurship")
              0.95625
              >>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
              1.0
          """
          first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
          second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
          match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
          return match_ratio
      

      这篇关于Python中的模糊字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆