过滤字符串列表,忽略其他项的子字符串 [英] Filter list of strings, ignoring substrings of other items

查看:121
本文介绍了过滤字符串列表,忽略其他项的子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何过滤包含字符串和子字符串的列表,以仅返回最长的字符串。 (如果列表中的任何项目是另一个的子字符串,则仅返回较长的字符串。)

How can I filter through a list which containing strings and substrings to return only the longest strings. (If any item in the list is a substring of another, return only the longer string.)

我有此功能。有没有更快的方法?

I have this function. Is there a faster way?

def filterSublist(lst):
    uniq = lst
    for elem in lst:
        uniq = [x for x in uniq if (x == elem) or (x not in elem)]
    return uniq

lst = ["a", "abc", "b", "d", "xy", "xyz"]
print filterSublist(lst)

> ['abc', 'd', 'xyz']
> Function time: 0.000011


推荐答案

一个简单的二次时间解是

A simple quadratic time solution would be this:

res = []
n = len(lst)
for i in xrange(n):
    if not any(i != j and lst[i] in lst[j] for j in xrange(n)):
        res.append(lst[i])

但是我们可以做得更好:

But we can do much better:

$ 是一个不会出现在您的任何字符串中且字符值低于所有实际字符的字符。

Let $ be a character that does not appear in any of your strings and has a lower value than all your actual characters.

S 是所有字符串的串联,两者之间是 $ 。在您的示例中, S = a $ abc $ b $ d $ xy $ xyz

Let S be the concatenation of all your strings, with $ in between. In your example, S = a$abc$b$d$xy$xyz.

您可以构建 S 后缀数组 。您还可以使用我描述的更为简单的O(n log ^ 2 n)构造算法。

You can build the suffix array of S in linear time. You can also use a much simpler O(n log^2 n) construction algorithm that I described in another answer.

现在,对于 lst 中的每个字符串,检查它是否出现在后缀中精确地排列一次。您可以进行两个二进制搜索来找到子字符串的位置,它们在后缀数组中形成一个连续范围。如果字符串多次出现,则将其删除。

Now for every string in lst, check if it occurs in the suffix array exactly once. You can do two binary searches to find the locations of the substring, they form a contiguous range in the suffix array. If the string occurs more than once, you remove it.

在预先计算LCP信息的情况下,也可以在线性时间内完成此操作。

With LCP information precomputed, this can be done in linear time as well.

示例O(n log ^ 2 n)的实现,改编自我的后缀数组答案

Example O(n log^2 n) implementation, adapted from my suffix array answer:

def findFirst(lo, hi, pred):
  """ Find the first i in range(lo, hi) with pred(i) == True.
  Requires pred to be a monotone. If there is no such i, return hi. """
  while lo < hi:
    mid = (lo + hi) // 2
    if pred(mid): hi = mid;
    else: lo = mid + 1
  return lo

# uses the algorithm described in https://stackoverflow.com/a/21342145/916657
class SuffixArray(object):
  def __init__(self, s):
    """ build the suffix array of s in O(n log^2 n) where n = len(s). """
    n = len(s)
    log2 = 0
    while (1<<log2) < n:
      log2 += 1
    rank = [[0]*n for _ in xrange(log2)]
    for i in xrange(n):
      rank[0][i] = s[i]
    L = [0]*n
    for step in xrange(1, log2):
      length = 1 << step
      for i in xrange(n):
        L[i] = (rank[step - 1][i],
                rank[step - 1][i + length // 2] if i + length // 2 < n else -1,
                i)
      L.sort()
      for i in xrange(n):
        rank[step][L[i][2]] = \
          rank[step][L[i - 1][2]] if i > 0 and L[i][:2] == L[i-1][:2] else i
    self.log2 = log2
    self.rank = rank
    self.sa = [l[2] for l in L]
    self.s = s
    self.rev = [0]*n
    for i, j in enumerate(self.sa):
      self.rev[j] = i

  def lcp(self, x, y):
    """ compute the longest common prefix of s[x:] and s[y:] in O(log n). """
    n = len(self.s)
    if x == y:
      return n - x
    ret = 0
    for k in xrange(self.log2 - 1, -1, -1):
      if x >= n or y >= n:
        break
      if self.rank[k][x] == self.rank[k][y]:
        x += 1<<k
        y += 1<<k
        ret += 1<<k
    return ret

  def compareSubstrings(self, x, lx, y, ly):
    """ compare substrings s[x:x+lx] and s[y:y+yl] in O(log n). """
    l = min((self.lcp(x, y), lx, ly))
    if l == lx == ly: return 0
    if l == lx: return -1
    if l == ly: return 1
    return cmp(self.s[x + l], self.s[y + l])

  def count(self, x, l):
    """ count occurences of substring s[x:x+l] in O(log n). """
    n = len(self.s)
    cs = self.compareSubstrings
    lo = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) >= 0)
    hi = findFirst(0, n, lambda i: cs(self.sa[i], min(l, n - self.sa[i]), x, l) > 0)
    return hi - lo

  def debug(self):
    """ print the suffix array for debugging purposes. """
    for i, j in enumerate(self.sa):
      print str(i).ljust(4), self.s[j:], self.lcp(self.sa[i], self.sa[i-1]) if i >0 else "n/a"

def filterSublist(lst):
  splitter = "\x00"
  s = splitter.join(lst) + splitter
  sa = SuffixArray(s)
  res = []
  offset = 0
  for x in lst:
    if sa.count(offset, len(x)) == 1:
      res.append(x)
    offset += len(x) + 1
  return res

但是,除非,否则解释开销可能会使其比O(n ^ 2)方法慢。 S 确实很大(大约10 ^ 5个字符或更多)。

However, the interpretation overhead likely causes this to be slower than the O(n^2) approaches unless S is really large (in the order of 10^5 characters or more).

这篇关于过滤字符串列表,忽略其他项的子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆