删除列表中的子字符串,其复杂度优于O(n ^ 2) [英] Remove substrings inside a list with better than O(n^2) complexity

查看:55
本文介绍了删除列表中的子字符串,其复杂度优于O(n ^ 2)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含很多单词(超过100.000个)的列表,我想做的就是删除列表中每个单词的所有子字符串.

I have a list with many words (100.000+), and what I'd like to do is remove all the substrings of every word in the list.

为简单起见,让我们假设我有以下列表:

So for simplicity, let's imagine that I have the following list:

words = ['Hello', 'Hell', 'Apple', 'Banana', 'Ban', 'Peter', 'P', 'e']

需要以下输出:

['Hello', 'Apple', 'Banana', 'Peter']

  • 'Hell'已被删除,因为它是'Hello'
  • 的子字符串
  • 'Ban'已被删除,因为它是'Banana'
  • 的子字符串
  • 'P'已被删除,因为它是'Peter'
  • 的子字符串
  • 'e'已被删除,因为它是'Hello''Hell''Apple',依此类推.
    • 'Hell' was removed because it is a substring of 'Hello'
    • 'Ban' was removed because it is a substring of 'Banana'
    • 'P' was removed because it is a substring of 'Peter'
    • 'e' was removed because it is a substring of 'Hello', 'Hell', 'Apple', and so on.
    • 我做了什么

      这是我的代码,但是我想知道是否有比这些嵌套理解更有效的方法.

      This is my code, but I am wondering if there is a more efficient way than these nested comprehensions.

      to_remove = [x for x in words for y in words if x != y and x in y]
      output = [x for x in words if x not in to_remove]
      

      如何提高性能?我应该改用regex吗?

      How can I improve the performance? Should I use regex instead?

      推荐答案

      @wim是正确的.

      给定一个固定长度的字母,以下算法在文本的总长度中是线性的.如果字母的大小不受限制,那么它将改为O(n log(n)).无论哪种方式,它都比O(n^2)更好.

      Given an alphabet of fixed length, the following algorithm is linear in the overall length of text. If the alphabet is of unbounded size, then it will be O(n log(n)) instead. Either way it is better than O(n^2).

      Create an empty suffix tree T.
      Create an empty list filtered_words
      For word in words:
          if word not in T:
              Build suffix tree S for word (using Ukkonen's algorithm)
              Merge S into T
              append word to filtered_words
      

      这篇关于删除列表中的子字符串,其复杂度优于O(n ^ 2)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆