查找字符串的*最*公共前缀-更好的方法? [英] Find *most* common prefix of strings - a better way?

查看:93
本文介绍了查找字符串的*最*公共前缀-更好的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个键列表 ['foo_a','foo_b','foo_c','fnord']

此处所有类似的解决方案都假定您的文本中没有 fnord

All of the similar solutions here assume that you have no fnord's in your text.

我有执行此工作的这段代码:

I have this code that does the job:

def detect_prefix(keys):
    PCT = 0.70 # cutof 
    pre = ''
    l = len(keys)
    for i in range(0, len(max(keys, key=len))):
        keys = filter(lambda k: k.startswith(pre), keys)
        cnt = dict()
        for k in map(lambda k: k[i], keys):
            cnt.setdefault(k,0)
            cnt[k] +=1
        if cnt[max(cnt)] / float(l) >= PCT:
            pre += max(cnt)
        else:
            break
    return pre

我有一个怀疑可以更优雅地完成此操作,但是我的python-fu目前不够强大。

I have a strong suspicion that this could be done much more elegantly, but my python-fu is not strong enough right now.

我很想听听一些建议。

修改
其他背景和说明。

这些是其他开发人员在应用程序中放入的用于翻译的键。
他们应该有一个通用的前缀,但是人们忘记了,他们从其他代码中剪切并粘贴了代码。作为前缀分隔符的 _只是一个约定。最好不要假设甚至使用了分隔符。 70%是完全任意的阈值。

是的,这是python 2.7,引号内的空间只是一个视觉人工制品。

Edit. Additional background and clarification.
These are keys that other developers have put in an application to use for translation. They should have a common prefix, but people forget, and they cut and paste from other code. "_" as a prefix separator is just a convention. It would be best not to assume a separator is even used. 70% is a totally arbitrary threshold. "most prevalent" or "predominant" would have worked too.
And yes, this is python 2.7, and the space inside the quotes is just a visual artefact.

推荐答案

如果您知道通用前缀所需的阈值频率:

If you know the desired threshold frequency for the common prefix:

#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest

strings = ['foo_a','foo_b','foo_c','fnord']
threshold = .7 * len(strings)
prefix = []
for chars in izip_longest(*strings, fillvalue=''):
    char, count = Counter(chars).most_common(1)[0]
    if count < threshold:
        break
    prefix.append(char)
print(''.join(prefix))
# -> foo_

或者您可以收集所有通用前缀及其频率并稍后决定:

Or you could collect all common prefixes and their frequencies and decide later:

#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest

strings = ['foo_a', 'foo_b','foo_c','fnord']
assert len(strings) > 1
threshold = len(strings)
prefix = []
prefixes = []
for chars in izip_longest(*strings, fillvalue=''):
    char, count = Counter(chars).most_common(1)[0]
    if count == 1:
        break
    elif count < threshold:
        if prefix:
            prefixes.append((''.join(prefix), threshold))
        threshold = count
    prefix.append(char)
if prefix:
    prefixes.append((''.join(prefix), threshold))
print(prefixes)
# -> [('f', 4), ('foo_', 3)]

两个代码示例均假定主要前缀存在,即每个位置上最常见的字符都属于最常见前缀。

Both code examples assume that the predominant prefix exists i.e., the most common character at each position belongs to the most common prefix.

这篇关于查找字符串的*最*公共前缀-更好的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆