Titan 如何使用 HBase/Cassandra 实现恒定时间查找? [英] How does Titan achieve constant time lookup using HBase / Cassandra?

查看:36
本文介绍了Titan 如何使用 HBase/Cassandra 实现恒定时间查找?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 O'Reilly 的书Graph Databases"的第 6 章中,关于 Neo4j 如何存储图形数据库,它说:

In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:

理解为什么原生图处理效率如此之高与基于重索引的图表相比,请考虑以下内容.根据实现,索引查找的算法复杂度可能是 O(log n),而查找直接关系的则是 O(1).要遍历 m 步的网络,索引方法的成本为O(m log n),使使用 O(m) 的实现的成本相形见绌无索引邻接.

To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.

然后说明 Neo4j 通过将所有节点和关系存储为固定大小的记录来实现这种恒定时间查找:

It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:

对于固定大小的记录和类似指针的记录 ID,遍历是简单地通过追逐数据结构周围的指针来实现,它可以以非常高的速度进行.遍历特定从一个节点到另一个节点的关系,数据库执行几个廉价的 ID 计算(这些计算比搜索全局索引,就像如果在一个图表中伪造一个图我们必须做的那样非图原生数据库)

With fixed sized records and pointer-like record IDs, traversals are implemented simply by chasing pointers around a data structure, which can be performed at very high speed. To traverse a particular relationship from one node to another, the database performs several cheap ID computations (these computations are much cheaper than searching global indexes, as we’d have to do if faking a graph in a non-graph native database)

这最后一句话触发了我的问题:使用Cassandra或HBase作为存储后端的Titan如何实现这些性能提升或弥补?

This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?

推荐答案

Neo4j 仅在数据位于同一 JVM 的内存中时才达到 O(1).当数据在磁盘上时,由于在磁盘上追逐指针,Neo4j 很慢(它们的磁盘表示很差).

Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).

Titan 仅在数据位于同一 JVM 的内存中时才达到 O(1).当数据在磁盘上时,Titan 比 Neo4j 更快,因为它具有更好的磁盘表示.

Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.

请参阅以下博客文章,从数量上解释了上述内容:http://thinkaurelius.com/2013/11/24/Boutique-graph-data-with-titan/

Please see the following blog post that explains the above quantitatively: http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/

因此,当人们说 O(1) 时,了解他们在内存层次结构的哪个部分很重要.当您在单个 JVM(单机)中时,很容易像 Neo4j 和 Titan 所展示的那样快速他们各自的缓存引擎.当你不能把整个图放在内存中时,你就不得不依赖智能磁盘布局、分布式缓存等.

Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.

有关详细信息,请参阅以下两篇博文:

Please see the following two blog posts for more information:

http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/

这篇关于Titan 如何使用 HBase/Cassandra 实现恒定时间查找?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆