查找字符串的*最*公共前缀-更好的方法? [英] Find *most* common prefix of strings - a better way?
问题描述
我有一个键列表 ['foo_a','foo_b','foo_c','fnord']
此处所有类似的解决方案都假定您的文本中没有 fnord
。
All of the similar solutions here assume that you have no fnord
's in your text.
我有执行此工作的这段代码:
I have this code that does the job:
def detect_prefix(keys):
PCT = 0.70 # cutof
pre = ''
l = len(keys)
for i in range(0, len(max(keys, key=len))):
keys = filter(lambda k: k.startswith(pre), keys)
cnt = dict()
for k in map(lambda k: k[i], keys):
cnt.setdefault(k,0)
cnt[k] +=1
if cnt[max(cnt)] / float(l) >= PCT:
pre += max(cnt)
else:
break
return pre
我有一个强怀疑可以更优雅地完成此操作,但是我的python-fu目前不够强大。
I have a strong suspicion that this could be done much more elegantly, but my python-fu is not strong enough right now.
我很想听听一些建议。
修改。
其他背景和说明。
这些是其他开发人员在应用程序中放入的用于翻译的键。
他们应该有一个通用的前缀,但是人们忘记了,他们从其他代码中剪切并粘贴了代码。作为前缀分隔符的 _只是一个约定。最好不要假设甚至使用了分隔符。 70%是完全任意的阈值。
是的,这是python 2.7,引号内的空间只是一个视觉人工制品。
Edit.
Additional background and clarification.
These are keys that other developers have put in an application to use for translation.
They should have a common prefix, but people forget, and they cut and paste from other code. "_" as a prefix separator is just a convention. It would be best not to assume a separator is even used. 70% is a totally arbitrary threshold. "most prevalent" or "predominant" would have worked too.
And yes, this is python 2.7, and the space inside the quotes is just a visual artefact.
推荐答案
如果您知道通用前缀所需的阈值频率:
If you know the desired threshold frequency for the common prefix:
#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest
strings = ['foo_a','foo_b','foo_c','fnord']
threshold = .7 * len(strings)
prefix = []
for chars in izip_longest(*strings, fillvalue=''):
char, count = Counter(chars).most_common(1)[0]
if count < threshold:
break
prefix.append(char)
print(''.join(prefix))
# -> foo_
或者您可以收集所有通用前缀及其频率并稍后决定:
Or you could collect all common prefixes and their frequencies and decide later:
#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest
strings = ['foo_a', 'foo_b','foo_c','fnord']
assert len(strings) > 1
threshold = len(strings)
prefix = []
prefixes = []
for chars in izip_longest(*strings, fillvalue=''):
char, count = Counter(chars).most_common(1)[0]
if count == 1:
break
elif count < threshold:
if prefix:
prefixes.append((''.join(prefix), threshold))
threshold = count
prefix.append(char)
if prefix:
prefixes.append((''.join(prefix), threshold))
print(prefixes)
# -> [('f', 4), ('foo_', 3)]
两个代码示例均假定主要前缀存在,即每个位置上最常见的字符都属于最常见前缀。
Both code examples assume that the predominant prefix exists i.e., the most common character at each position belongs to the most common prefix.
这篇关于查找字符串的*最*公共前缀-更好的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!