Python字符串模式识别/压缩 [英] Python string pattern recognition/compression
问题描述
我可以做基本的正则表达式,但这有点不同,即我不知道模式将是什么样。
I can do basic regex alright, but this is slightly different, namely I don't know what the pattern is going to be.
例如,我有相似字符串的列表:
For example, I have a list of similar strings:
lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
在这种情况下,通用模式是两段通用文本:'sometxt'
和'moretxt'
,以长度可变的其他内容开头和分隔。
In this case the common pattern is two segments of common text: 'sometxt'
and 'moretxt'
, starting and separated by something else that is variable in length.
公共字符串和可变字符串当然可以在任何顺序和任何次数的情况下发生。
The common string and variable string can of course occur at any order and at any number of occasions.
什么是压缩/压缩列表的好方法?
What would be a good way to condense/compress the list of strings into their common parts and individual variations?
示例输出可能是:
c = ['sometxt', 'moretxt']
v = [('a','0'), ('b','1'), ('aa','10'), ('zz','999')]
推荐答案
此解决方案找到两个最长的公共子字符串,并使用它们来分隔输入字符串:
This solution finds the two longest common substrings and uses them to delimit the input strings:
def an_answer_to_stackoverflow_question_1914394(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> an_answer_to_stackoverflow_question_1914394(lst)
(['sometxt', 'moretxt'], [('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')])
"""
delimiters = find_delimiters(lst)
return delimiters, list(split_strings(lst, delimiters))
find_delimiters
和朋友找到分隔符:
import itertools
def find_delimiters(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> find_delimiters(lst)
['sometxt', 'moretxt']
"""
candidates = list(itertools.islice(find_longest_common_substrings(lst), 3))
if len(candidates) == 3 and len(candidates[1]) == len(candidates[2]):
raise ValueError("Unable to find useful delimiters")
if candidates[1] in candidates[0]:
raise ValueError("Unable to find useful delimiters")
return candidates[0:2]
def find_longest_common_substrings(lst):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> list(itertools.islice(find_longest_common_substrings(lst), 3))
['sometxt', 'moretxt', 'sometx']
"""
for i in xrange(min_length(lst), 0, -1):
for substring in common_substrings(lst, i):
yield substring
def min_length(lst):
return min(len(item) for item in lst)
def common_substrings(lst, length):
"""
>>> list(common_substrings(["hello", "world"], 2))
[]
>>> list(common_substrings(["aabbcc", "dbbrra"], 2))
['bb']
"""
assert length <= min_length(lst)
returned = set()
for i, item in enumerate(lst):
for substring in all_substrings(item, length):
in_all_others = True
for j, other_item in enumerate(lst):
if j == i:
continue
if substring not in other_item:
in_all_others = False
if in_all_others:
if substring not in returned:
returned.add(substring)
yield substring
def all_substrings(item, length):
"""
>>> list(all_substrings("hello", 2))
['he', 'el', 'll', 'lo']
"""
for i in range(len(item) - length + 1):
yield item[i:i+length]
split_strings
使用定界符分割字符串:
split_strings
splits the strings using the delimiters:
import re
def split_strings(lst, delimiters):
"""
>>> lst = ['asometxt0moretxt', 'bsometxt1moretxt', 'aasometxt10moretxt', 'zzsometxt999moretxt']
>>> list(split_strings(lst, find_delimiters(lst)))
[('a', '0'), ('b', '1'), ('aa', '10'), ('zz', '999')]
"""
for item in lst:
parts = re.split("|".join(delimiters), item)
yield tuple(part for part in parts if part != '')
这篇关于Python字符串模式识别/压缩的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!