获取一个包含字符串元素的列表,该字符串元素不包括初始列表中以任何其他元素为前缀的元素 [英] Obtain a list containing string elements excluding elements prefixed with any other element from initial list
问题描述
我在过滤字符串列表时遇到了一些麻烦.我在此处找到了类似的问题,但这不是我所需要的. /p>
输入列表为:
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
,预期结果是
['ab', 'xc', 'sdfdg']
结果中项目的顺序并不重要
过滤功能必须快速,因为列表很大
我当前的解决方案是
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
for i in range(0, len(l) - 1):
for j in range(i + 1, len(l)):
if l[j].startswith(l[i]):
l[j] = l[i]
else:
if l[i].startswith(l[j]):
l[i] = l[j]
print list(set(l))
编辑
在对具有大输入数据的多个测试之后,列出了1500000个字符串,对此的最佳解决方案是:
def filter(l):
if l==[]:
return []
l2=[]
l2.append(l[0])
llen = len(l)
k=0
itter = 0
while k<llen:
addkelem = ''
j=0
l2len = len(l2)
while j<l2len:
if (l2[j].startswith(l[k]) and l[k]!= l2[j]):
l2[j]=l[k]
l.remove(l[k])
llen-=1
j-=1
addkelem = ''
continue
if (l[k].startswith(l2[j])):
addkelem = ''
break
elif(l[k] not in l2):
addkelem = l[k]
j+=1
if addkelem != '':
l2.append(addkelem)
addkelem = ''
k+=1
return l2
执行时间约为213秒
采样输入数据-每行都是列表中的字符串
此算法使用输入在我的计算机上完成0.97秒的任务作者提交的文件(154MB):
l.sort()
last_str = l[0]
filtered = [last_str]
app = filtered.append
for str in l:
if not str.startswith(last_str):
last_str = str
app(str)
# Commented because of the massive amount of data to print.
# print filtered
算法很简单:首先按字典顺序对列表进行排序,然后搜索第一个没有以列表的第一个为前缀的字符串,然后搜索没有以最后一个没有前缀的为前缀的下一个字符串,等
如果列表已经排序(您的示例文件似乎已经排序),则可以删除l.sort()
行,这将导致时间和内存复杂度为O(n).
I have some trouble with filtering a list of strings. I found a similar question here but is not what i need.
The input list is:
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
and the expected result is
['ab', 'xc', 'sdfdg']
The order of the items in the result is not important
The filter function must be fast because the size of list is big
My current solution is
l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
for i in range(0, len(l) - 1):
for j in range(i + 1, len(l)):
if l[j].startswith(l[i]):
l[j] = l[i]
else:
if l[i].startswith(l[j]):
l[i] = l[j]
print list(set(l))
EDIT
After multiple tests with a big input data, list with 1500000 strings, my best solution for this is:
def filter(l):
if l==[]:
return []
l2=[]
l2.append(l[0])
llen = len(l)
k=0
itter = 0
while k<llen:
addkelem = ''
j=0
l2len = len(l2)
while j<l2len:
if (l2[j].startswith(l[k]) and l[k]!= l2[j]):
l2[j]=l[k]
l.remove(l[k])
llen-=1
j-=1
addkelem = ''
continue
if (l[k].startswith(l2[j])):
addkelem = ''
break
elif(l[k] not in l2):
addkelem = l[k]
j+=1
if addkelem != '':
l2.append(addkelem)
addkelem = ''
k+=1
return l2
for which the execution time is around of 213 seconds
Sample imput data - each line is a string in list
This algorithm completes the task in 0.97 second on my computer, with the input file submitted by the author (154MB):
l.sort()
last_str = l[0]
filtered = [last_str]
app = filtered.append
for str in l:
if not str.startswith(last_str):
last_str = str
app(str)
# Commented because of the massive amount of data to print.
# print filtered
The algorithm is simple: first sort the list lexicographically, then search for the first string which isn't prefixed by the very first one of the list, then search the next one which isn't prefixed by the last unprefixed one, etc.
If the list is already sorted (your example file seems to be already sorted), you can remove the l.sort()
line, which will result in a O(n) complexity in both time and memory.
这篇关于获取一个包含字符串元素的列表,该字符串元素不包括初始列表中以任何其他元素为前缀的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!