我可以使用比树更快的数据结构吗? [英] Could I use a faster data structure than a tree for this?

查看:164
本文介绍了我可以使用比树更快的数据结构吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个二进制决策树。它将输入作为一个浮动数组,每个分支节点分割成一个输入索引,值最终使我进入一个叶。



我正在执行大量的在这棵树上查找(约17%的执行时间根据性能分析(编辑:优化其他领域现在几乎达到40%)),并且想知道我是否可以/应该使用不同的数据结构来提高查找速度。



某些哈希表不能使用,因为输入不直接映射到叶节点,但我想知道有人有任何建议方法和数据结构我可以使用树代替(或者)来提高查找速度。



内存是一个问题,但比速度更少



代码目前是用C#编写的,但显然可以应用任何方法。



编辑:
有一个太多的代码可以发布,但我会给出更多关于树的细节。



树是g使用信息增益计算使用,并不总是50/50分割,分割值可以是任何浮点值。单个输入也可以多次分割,增加该输入的分辨率。



我在这里发布了一个关于迭代器性能的问题:



微型优化遍历树中C#



但我想我可能需要看一下数据结构本身来提高性能。



我的目标是在这里尽可能多的表现。我正在研究一种新的机器学习方法,并且使用反馈循环来生长树。对于我正在开展的过程,我估计它将运行几个月,所以在这里节省了几个百分点。最终的目标是速度,而不用太多的记忆。

解决方案

如果我理解正确,你有浮点范围映射到一个决定。这样做:

  x< = 0.0:决定A 
0.0< x≤0.5:决定B
0.5 < x <= 0.6:决定C
0.6 < x:决定D

二叉树是一个很好的处理方式。只要树平衡良好,并且输入值均匀分布在范围内,则可以预期O(log 2 n)比较,其中n是可能的决定数。



如果树不平衡,那么你可能做的更多比较比较需要。在最坏的情况下:O(n)。所以我会看树,看看他们有多深。如果同一棵树一次又一次地使用,则重新平衡一次的成本可能会在许多查找中分摊。



如果输入值不均匀分布(并且您知道那个时候),那么你可能想要比较比较的顺序,以便最早发现最常见的情况。您可以通过操作树或在实际检查树之前在代码中添加特殊情况来执行此操作。



如果您用尽了算法的改进,而且还需要优化你可能会调查一个比一般二叉树更好的局部性的数据结构。例如,您可以将分区边界放入连续的数组中,并对其执行二进制搜索。 (而且,如果数组不太长,甚至可能会尝试对数组进行线性搜索,因为它可能对缓存和分支预测更为友好。)



最后,我会考虑构建一个粗略的索引,让我们在树(或数组)中创建一个头像。例如,使用输入值的几个最高有效位作为索引,并查看是否可以切断树的前几个层。这可能比你想象的更有帮助,因为跳过的比较可能获得正确的分支预测机会很小。


I have a binary decision tree. It takes inputs as an array of floats, and each branch node splits on an input index and value eventually taking me to a leaf.

I'm performing a massive number of lookups on this tree (about 17% of execution time according to performance analysis (Edit: Having optimised other areas it's now at almost 40%)), and am wondering if I could/should be using a different data structure to improve lookup speed.

Some kind of hash table can't be used, as inputs do not map directly to a leaf node, but I was wondering is anyone had any suggesting as to methods and data-structures I could use in place of the tree (or as well as?) to improve lookup speeds.

Memory is a concern, but less of a concern than speed.

Code is currently written in C#, but obviously any method could be applied.

Edit: There's a bit too much code to post, but I'll give more detail about the tree.

The tree is generated using information gain calculations, it's not always a 50/50 split, the split value could be any float value. A single input could also be split multiple times increasing the resolution on that input.

I posted a question about performance of the iterator here:

Micro optimisations iterating through a tree in C#

But I think I might need to look at the data structure itself to improve performance further.

I'm aiming for as much performance as possible here. I'm working on a new method of machine learning, and the tree grows itself using a feedback loop. For the process I'm working on, I estimate it'll be running for several months, so a few % saving here and there is massive. The ultimate goal is speed without using too much memory.

解决方案

If I understand correctly, you have floating point ranges than have to be mapped to a decision. Something like this:

       x <= 0.0      : Decision A
 0.0 < x <= 0.5      : Decision B
 0.5 < x <= 0.6      : Decision C
 0.6 < x             : Decision D

A binary tree is a pretty good way to handle that. As long as the tree is well balanced and the input values are evenly distributed across the ranges, you can expect O(log2 n) comparisons, where n is the number of possible decisions.

If the tree is not balanced, then you could be doing far more comparisons than necessary. In the worst case: O(n). So I would look at the trees and see how deep they are. If the same tree is used again and again, then the cost spent rebalancing once may be amortized over many lookups.

If the input values are not evenly distributed (and you know that ahead of time), then you might want to special-case the order of the comparisons so that the most common cases are detected early. You can do this by manipulating the tree or by adding special cases in the code before actually checking the tree.

If you've exhausted algorithmic improvements and you still need to optimize, you might look into a data structure with better locality than a general binary tree. For example, you could put the partition boundaries into a contiguous array and perform a binary search on it. (And, if the array isn't too long, you might even try a linear search on the array as it may be friendlier for the cache and the branch prediction.)

Lastly, I'd consider building a coarse index that gives us a headstart into the tree (or array). For example, use a few of the most significant bits of the input value as an index and see if that can cut off the first few layers of the tree. This may help more than you might imagine, as the skipped comparisons probably have a low chance of getting correct branch predictions.

这篇关于我可以使用比树更快的数据结构吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆