删除字符串数组中重复项的最佳算法 [英] Best algorithm for delete duplicates in array of strings

查看:21
本文介绍了删除字符串数组中重复项的最佳算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天在学校,老师要求我们实现一个重复删除算法.没那么难,大家想出了如下解决方案(伪代码):

Today at school the teacher asked us to implement a duplicate-deletion algorithm. It's not that difficult, and everyone came up with the following solution (pseudocode):

for i from 1 to n - 1
    for j from i + 1 to n
        if v[i] == v[j] then remove(v, v[j])    // remove(from, what)
    next j
next i

此算法的计算复杂度为 n(n-1)/2.(我们在读高中,还没谈过big-O,不过好像是O(n^2)).这个解决方案看起来很丑,当然也很慢,所以我尝试更快地编写代码:

The computational complexity for this algo is n(n-1)/2. (We're in high school, and we haven't talked about big-O, but it seems to be O(n^2)). This solution appears ugly and, of course, slow, so I tried to code something faster:

procedure binarySearch(vector, element, *position)
    // this procedure searches for element in vector, returning
    // true if found, false otherwise. *position will contain the
    // element's place (where it is or where it should be)
end procedure

----

// same type as v
vS = new array[n]

for i from 1 to n - 1
    if binarySearch(vS, v[i], &p) = true then
        remove(v, v[i])
    else
        add(vS, v[i], p)      // adds v[i] in position p of array vS
    end if
next i

这样 vS 将包含我们已经传递的所有元素.如果元素 v[i] 在此数组中,则它是重复的并被删除.二进制搜索的计算复杂度是 log(n),而主循环(第二个片段)的计算复杂度是 n.因此,如果我没记错的话,整个 CC 是 n*log(n).

This way vS will contain all the elements we've already passed. If element v[i] is in this array, then it is a duplicate and is removed. The computational complexity for the binary search is log(n) and for the main loop (second snippet) is n. Therefore the whole CC is n*log(n) if I'm not mistaken.

然后我有了另一个关于使用二叉树的想法,但我无法放下它.
基本上我的问题是:

Then I had another idea about using a binary tree, but I can't put it down.
Basically my questions are:

  • 我的CC计算对吗?(如果不是,为什么?)
  • 有更快的方法吗?

谢谢

推荐答案

最简单的解决方案是简单地对数组进行排序(如果你可以使用它们,使用标准实现需要 O(n log n).否则考虑做一个简单的随机化快速排序(代码甚至在维基百科上)).

The easiest solution will be to simply sort the array (takes O(n log n) with standard implementation if you may use them. otherwise consider making an easy randomized quicksort (code is even on wikipedia)).

然后再扫描一次.在该扫描期间,简单地消除连续相同的元素.

Afterwards scan it for one additional time. During that scan simple eliminate consecutive identical elements.

如果你想在 O(n) 中完成,你也可以使用 HashSet 和你已经见过的元素.只需遍历您的数组一次,检查每个元素是否在您的 HashSet 中.

If you want to do it in O(n), you can also use a HashSet with elements you have already seen. Just iterate once over your array, for each element check if it is in your HashSet.

如果它不在那里,请添加它.如果它在那里,请将其从数组中删除.

If it isn't in there, add it. If it is in there, remove it from the array.

请注意,这将占用一些额外的内存,并且散列将有一个有助于您运行时的常数因素.虽然时间复杂度更好,但实际运行时间只有超过一定的数组大小才会更快

Note, that this will take some additional memory and the hashing will have a constant factor that contributes to your runtime. Althought the time complexity is better, the practical runtime will only be onyl be faster once you exceed a certain array size

这篇关于删除字符串数组中重复项的最佳算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆