获取一个包含字符串元素的列表,该字符串元素不包括初始列表中以任何其他元素为前缀的元素 [英] Obtain a list containing string elements excluding elements prefixed with any other element from initial list

查看:100
本文介绍了获取一个包含字符串元素的列表,该字符串元素不包括初始列表中以任何其他元素为前缀的元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在过滤字符串列表时遇到了一些麻烦.我在此处找到了类似的问题,但这不是我所需要的. /p>

输入列表为:

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']

,预期结果是

['ab', 'xc', 'sdfdg']

结果中项目的顺序并不重要

过滤功能必须快速,因为列表很大

我当前的解决方案是

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
for i in range(0, len(l) - 1):
    for j in range(i + 1, len(l)):
        if l[j].startswith(l[i]):
            l[j] = l[i]
        else:
            if l[i].startswith(l[j]):
                l[i] = l[j]

print list(set(l)) 

编辑

在对具有大输入数据的多个测试之后,列出了1500000个字符串,对此的最佳解决方案是:

def filter(l):
    if l==[]:
        return []
    l2=[]
    l2.append(l[0])
    llen = len(l)
    k=0
    itter = 0
    while k<llen:
        addkelem = ''
        j=0
        l2len = len(l2)
        while j<l2len:
            if (l2[j].startswith(l[k]) and l[k]!= l2[j]):
                l2[j]=l[k]
                l.remove(l[k])
                llen-=1
                j-=1
                addkelem = ''
                continue
            if (l[k].startswith(l2[j])):
                addkelem = ''
                break
            elif(l[k] not in l2):
                addkelem = l[k]
            j+=1
        if addkelem != '':
            l2.append(addkelem)
            addkelem = ''
        k+=1
    return l2

执行时间约为213秒

采样输入数据-每行都是列表中的字符串

解决方案

此算法使用输入在我的计算机上完成0.97秒的任务作者提交的文件(154MB):

l.sort()

last_str = l[0]
filtered = [last_str]
app      = filtered.append

for str in l:
    if not str.startswith(last_str):
        last_str = str
        app(str)

# Commented because of the massive amount of data to print.
# print filtered

算法很简单:首先按字典顺序对列表进行排序,然后搜索第一个没有以列表的第一个为前缀的字符串,然后搜索没有以最后一个没有前缀的为前缀的下一个字符串,等

如果列表已经排序(您的示例文件似乎已经排序),则可以删除l.sort()行,这将导致时间和内存复杂度为O(n).

I have some trouble with filtering a list of strings. I found a similar question here but is not what i need.

The input list is:

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']

and the expected result is

['ab', 'xc', 'sdfdg']

The order of the items in the result is not important

The filter function must be fast because the size of list is big

My current solution is

l = ['ab', 'xc', 'abb', 'abed', 'sdfdg', 'abfdsdg', 'xccc']
for i in range(0, len(l) - 1):
    for j in range(i + 1, len(l)):
        if l[j].startswith(l[i]):
            l[j] = l[i]
        else:
            if l[i].startswith(l[j]):
                l[i] = l[j]

print list(set(l)) 

EDIT

After multiple tests with a big input data, list with 1500000 strings, my best solution for this is:

def filter(l):
    if l==[]:
        return []
    l2=[]
    l2.append(l[0])
    llen = len(l)
    k=0
    itter = 0
    while k<llen:
        addkelem = ''
        j=0
        l2len = len(l2)
        while j<l2len:
            if (l2[j].startswith(l[k]) and l[k]!= l2[j]):
                l2[j]=l[k]
                l.remove(l[k])
                llen-=1
                j-=1
                addkelem = ''
                continue
            if (l[k].startswith(l2[j])):
                addkelem = ''
                break
            elif(l[k] not in l2):
                addkelem = l[k]
            j+=1
        if addkelem != '':
            l2.append(addkelem)
            addkelem = ''
        k+=1
    return l2

for which the execution time is around of 213 seconds

Sample imput data - each line is a string in list

解决方案

This algorithm completes the task in 0.97 second on my computer, with the input file submitted by the author (154MB):

l.sort()

last_str = l[0]
filtered = [last_str]
app      = filtered.append

for str in l:
    if not str.startswith(last_str):
        last_str = str
        app(str)

# Commented because of the massive amount of data to print.
# print filtered

The algorithm is simple: first sort the list lexicographically, then search for the first string which isn't prefixed by the very first one of the list, then search the next one which isn't prefixed by the last unprefixed one, etc.

If the list is already sorted (your example file seems to be already sorted), you can remove the l.sort() line, which will result in a O(n) complexity in both time and memory.

这篇关于获取一个包含字符串元素的列表,该字符串元素不包括初始列表中以任何其他元素为前缀的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆