在元组列表中搜索匹配子串的算法方法? [英] Algorithmic way to search a list of tuples for a matching substring?

查看:57
本文介绍了在元组列表中搜索匹配子串的算法方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个元组列表,大约有10万个条目.每个元组都由一个ID和一个字符串组成,我的目标是列出这些元组的ID,其字符串包含给定子字符串列表中的一个子字符串.我当前的解决方案是通过集合理解,ID可以重复.

I have a list of tuples, about 100k entries. Each tuple consists of an id and a string, my goal is to list the ids of the tuples, whose strings contain a substring from a given list of substrings. My current solution is through set comprehension, ids can repeat.

tuples = [(id1, 'cheese trees'), (id2, 'freezy breeze'),...]
vals = ['cheese', 'flees']
ids = {i[0] for i in tuples if any(val in i[1] for val in vals)}

output: {id1}

是否有一种算法可以更快地执行此操作?我对精确的子字符串匹配以及大概的匹配都感兴趣.我在这里追求的主要是一种算法,该算法将提供比理解更快的速度.

Is there an algorithm that would allow doing that quicker? I'm interested in both exact substring matches and also possibly in the approximate ones. The main thing I'm after here is an algorithm that would offer speed advantage over the comprehension.

推荐答案

免责声明我是 trrex

对于完全匹配的情况,解决此问题的一种方法是使用 Trie . trrex 是一个制作Trie-Regex(正则表达式格式的Trie)的库,该库可用于与Python的正则表达式引擎结合使用:

For the case of the exact matching, one approach for solving this, is to use a Trie, as mentioned in the comments. trrex is a library that makes a Trie-Regex (a Trie in regex format) that can be used in conjunction with the regular expression engine of Python:

import random
import pandas as pd
import trrex as tx
import re

df = pd.read_csv('jeopardy-small.csv')
with open('words-sample') as infile:
    words = [line.strip() for line in infile]


tuples = [(random.randint(1, 250), sentence) for sentence in df['question']]


def fun_kislyuk(ws, ts):
    return {t[0] for t in ts if any(w in t[1] for w in ws)}


def fun_trrex(ws, ts):
    pattern = re.compile(tx.make(ws, left='', right=''))
    return {i for i, s in ts if pattern.search(s)}


if __name__ == "__main__":
    print(fun_trrex(words, tuples) == fun_kislyuk(words, tuples))

输出

True

上述功能的计时时间为:

The timings for the above functions are:

%timeit fun_trrex(words, tuples)
11.3 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_kislyuk(words, tuples)
67.5 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

数据是来自危险地区的大约2K个问题和500个随机选择的单词的列表.您可以在此处找到用于再现实验的资源.

The data is a list of around 2K questions from jeopardy, and 500 randomly chosen words. You can find here the resources for reproducing the experiments.

更新

如果添加注释中提到的分组策略,则时间改进会增加,以下是功能:

If you add the grouping strategy mentioned in the comments the time improvements increases, below is the function:

def fun_grouping_trrex(ws, ts):
    pattern = re.compile(tx.make(ws, left='', right=''))
    groups = defaultdict(list)
    for i, s in ts:
        groups[i].append(s)

    return {i for i, vs in groups.items() if any(pattern.search(v) for v in vs)}

和时间:

%timeit fun_trrex(words, tuples)
11.2 ms ± 61.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_grouping_trrex(words, tuples)
4.96 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_kislyuk(words, tuples)
67.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

分组 + trrex 的方法可为您带来大约 10倍的性能提升.但是,最后的结果要花一分钱,因为它非常依赖于数据集.

The approach of grouping + trrex gives you an approximated 10 times improvement on performance. But take this last result with a grain of salt because it's very dependent on the dataset.

这篇关于在元组列表中搜索匹配子串的算法方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆