百万行 Pandas df 上的模糊正则表达式匹配 [英] Fuzzy regex match on million rows Pandas df

查看:75
本文介绍了百万行 Pandas df 上的模糊正则表达式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试检查字符串列和引用列表之间的模糊匹配.字符串系列包含超过 1 m 行,参考列表包含超过 10 k 个条目.

I am trying to check for fuzzy match between a string column and a reference list. The string series contains over 1 m rows and the reference list contains over 10 k entries.

例如:

df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows

ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows

###Output should look like 

df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])

如果单词单独出现在字符串中,它应该生成匹配(并且在其中,最多允许 1 个字符替换)

It should generate match if the word appears separately in the string (and within that, upto 1 char substitution allowed)

例如 - 'PARIS' 可以匹配 'PARIS HILTON'、'THE HARIS DOWNTOWN',但不能匹配 'APARISIAN'.

For eg - 'PARIS' can match against 'PARIS HILTON', 'THE HARIS DOWNTOWN', but not against 'APARISIAN'.

类似地,'XANDER' 与 'NOVA XANDER' 和 'SALA MANDER' 匹配(MANDER 是 XANDER 的 1 个字符差异),但不生成与 ' 的匹配亚历山大.

Similarly, 'XANDER' matches against 'NOVA XANDER' and 'SALA MANDER' (MANDER being 1 char diff from XANDER) , but does not generate match against 'ALEXANDERS'.

截至目前,我们已经编写了相同的逻辑(如下所示),尽管比赛需要 4 个小时才能运行.需要将其缩短到 30 分钟以下.

As of now, we have written the logic for the same (shown below), although the match takes over 4 hrs to run.. Need to get this to under 30 mins.

当前代码 -

tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])


def search_it(partyname):
    m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
    if m is not None:
        return m.group()
    else:
        return None
    
df['MATCH'] = df['NAMES'].str.apply(search_it)

此外,多处理对正则表达式有帮助吗?非常感谢!

Also, will multiprocessing help with regex ? Many thanks in advance!

推荐答案

你的模式效率很低,因为你在正则表达式中重复了三次 tag 模式.你只需要用所谓的空白边界创建一个模式,(?<!\S)(?!\S),你只需要一种 tag 模式.

Your pattern is rather inefficient, as you repeat tag pattern thrice in the regex. You just need to create a pattern with the so-called whitespace boundaries, (?<!\S) and (?!\S), and you will only need one tag pattern.

接下来,如果您有数千个替代项,即使是单个 tag 模式正则表达式也会非常慢,因为可能会出现匹配字符串中相同位置的替代项,因此,回溯太多.

Next, if you have several thousands alternative, even the single tag pattern regex will be extremely slow because there can appear such alternatives that match at the same location in the string, and thus, there will be too much backtracking.

为了减少这种回溯,您需要一个正则表达式树.

To reduce this backtracking, you will need a regex trie.

这是一个工作片段:

import regex
import pandas as pd

## Class to build a regex trie, see https://stackoverflow.com/a/42789508/3832970
class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return regex.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row

trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
    trie.add(word)

tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)

def search_it(partyname):
    m = tags_ptn_regex.search(partyname)
    if m is not None:
        return m.group()
    else:
        return None
    
df['MATCH'] = df['NAMES'].apply(search_it)

这篇关于百万行 Pandas df 上的模糊正则表达式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆