哪个更快,对向量进行排序,然后将其放入AVL树中,还是直接输入? [英] Which is faster, sorting a vector, then putting it into an AVL tree, or inputting it directly?

查看:49
本文介绍了哪个更快,对向量进行排序,然后将其放入AVL树中,还是直接输入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

情况是这样的:

我有数百万,可能是数十亿的字符串,我试图解析并放入排序结构中,假设我有 5,000,000 个字符串.我正在尝试编写一个快速程序,可以将所有这些字符串从一个未排序的向量放入一个有序的数据结构中,该数据结构也可以快速搜索该结构,从而推理 AVL 树(最终我计划使用哈希表az 进行更快的查找,但稍后会出现).我首先将所有字符串放入一个向量中,但它们都被打乱、未排序且长度不同.我不想在我的树中出现任何重复的字符串,所以如果程序找到字符串hello"和hello",它只会有一个 AVL 条目,并且会为该字符串出现的频率增加一个整数持有者.

I have millions, possibly billions, of strings that I am trying to parse and put into a sorted structure, lets say I have 5,000,000 strings. I'm trying to write a fast program that can put all of these strings from an unsorted vector into an ordered data structure that can also search the structure fast, thus the reasoning for the AVL tree (which eventually I plan to use a hash table of a-z for even faster lookup, but that comes later). I get all of the strings into a vector first, but they are all jumbled up, unsorted and different lengths. I don't want any repeated strings in my tree, so if the program finds the strings "hello" and "hello" it would only have one AVL entry and would increment an integer holder for the frequency that this string has appeared.

所以我的问题是:先对向量进行排序(使用像多线程快速排序或其他什么的快速排序)然后将其输入到 AVL 树中,在所有单词与其他词一起排序后,是否会更快相同的词,或者将未排序向量中的所有数据放入 AVL 树中,并不断检查 AVL 树中是否存在某个词,然后递增它会更快.

So my question is this: would it be faster to sort the vector first (using something fast like a multi-threaded quicksort or something else) and then input it into the AVL tree, after all the words are sorted together with other same words, OR is it faster to just put all the data from the unsorted vector into the AVL tree, and continously checking the AVL tree for whether or not a word already exists, then incrementing it.

所以按照操作顺序来描绘它是两种情况:

So to picture it in an order of operations here are the two cases:

CASE A:

> Get input/parse strings
> Put strings into vector (unsorted)
> Put vector into array or linked-list
> Quicksort that array/llist
> Input that sorted array into the AVL Tree

CASE B:

> Get input/parse strings
> Put strings into vector (unsorted)
> Insert vector data into AVL tree
> During insertion, check if there are duplicate words, if so, increment the counter

哪种情况更快??

--EDIT-- 所以在听了一些评论之后,从一开始就将一个排序的数组插入到 AVL 树中将是一个坏主意,这是有道理的,因为有多少旋转制成.似乎直接插入 AVL 树可能是一个好主意,但是当某个单词已经在树中的某处时,有效插入的最佳方法是什么?我怎样才能确保找到它?那是我可以进行排序的地方吗?

--EDIT-- So after hearing some of the comments, inserting a sorted array into an AVL tree from the beginning would be a bad idea, which makes sense due to how many rotations would be made. It seems that directly inserting into the AVL tree is probably a good idea, but what is the best way to efficiently insert when a word is already in the tree somewhere? How can I make sure that I find it? Is that where my sorting can come in?

推荐答案

想想 AVL 树的平衡工作方式.如果中间值"在先,效果最好.对于已排序的输入,您将需要大量重新平衡,因此预排序可能弊大于利.

Think of the way balancing works for AVL trees. It works best if the "middle values" come first. With a sorted input, you will need a lot of re-balancing, thus pre-sorting will probably do more harm than good.

例如,考虑以下包含值 1-6 的 AVL 树:

For example, consider the following AVL tree holding the values 1-6:

    4
   / \
  2   5
 / \   \
1   3   6

如果输入顺序是 4, 2, 5, 1, 3, 6,您将永远不需要平衡树.相反,对于已排序的输入 1, 2, 3, 4, 5, 6,您将需要许多重新平衡操作:

If the input order is 4, 2, 5, 1, 3, 6, you'll never need to balance the tree. In contrast, for a sorted input 1, 2, 3, 4, 5, 6, you'll need many re-balancing operations:

  1     +3     2     +4     2       +5     2       +6       3
   \   --->   / \   --->   / \     --->   / \     --->     / \
    2        1   3        1   3          1   4            2   5
                               \            / \          /   / \
                                4          3   5        1   4   6

<小时>

UPDATE 最初的问题是在插入 AVL 树之前对数据进行排序是否会提高性能.现在 OP 编辑​​了问题,转向他的具体问题.


UPDATE The original question was whether sorting data before inserting into an AVL tree would improve performance. Now the OP edited the question, shifting towards his specific problem.

但是当某个单词已经在树中的某处时,有效插入的最佳方法是什么?我怎样才能确保找到它?那是我可以进行排序的地方吗?

but what is the best way to efficiently insert when a word is already in the tree somewhere? How can I make sure that I find it? Is that where my sorting can come in?

AVL 树的全部意义在于有效地查找数据,所以我不明白这个问题.如何遍历二叉搜索树找到一个值应该是显而易见的.为什么要为此对数据进行排序?

The whole point of an AVL tree is to efficiently find data, so I don't understand the question. It should be obvious how to traverse the binary search tree to find a value. Why would you want to sort data for that?

请注意,二叉搜索树是一种很好的存储的数据结构,但它也可以管理与这些键相关联的任意数据.在您的情况下,您希望将计数与密钥一起存储.因此,您不需要单词/字符串树,而是表示单词及其计数的对(字符串、整数)树.对于树序,只考虑字符串键,即单词.

Please note that binary search trees are a good data structure to store keys, but it can also manage arbitrary data associated with these keys. In your case, you want to store a count along with your keys. Therefore, you don't need a tree of words/strings, but a tree of pairs (string, integer) that represent the word and its count. For the tree order, just consider the string key, i.e. the word.

对于要插入的每个单词,在树中查找.如果它已经存在,请更新字数.否则,插入一个字数为 1 的新对.

For each word to insert, look it up in the tree. If it already exists, update the word count. Otherwise, insert a new pair with a word count of one.

最后一点:C++ 标准库带有一个 map 类型,它通常(总是?)使用平衡树(AVL 或红黑)实现.只需使用此实现,您就可以为自己节省大量工作和错误修复.由于 C++11 也有 un unordered_map,通常(总是?)使用哈希表实现.

A final note: The C++ standard library comes with a map type that is usually (always?) implemented using a balancing tree (AVL or red-black). You'll save yourself a lot of work and bug-fixing by just using this implementation. Since C++11 there is also un unordered_map, usually (always?) implemented using a hash table.

这篇关于哪个更快,对向量进行排序,然后将其放入AVL树中,还是直接输入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆