Titan如何使用HBase/Cassandra实现恒定时间查找? [英] How does Titan achieve constant time lookup using HBase / Cassandra?

查看:68
本文介绍了Titan如何使用HBase/Cassandra实现恒定时间查找?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在O'Reilly的第6章图形数据库"一书中,有关Neo4j如何存储图形数据库说:

In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:

了解为什么本机图处理效率如此之高 与基于重索引的图相比,请考虑以下因素.根据实现的不同,索引查找的算法复杂度可能为O(log n),而查找直接关系的索引查找可能为O(1). 要遍历m步的网络,索引方法的成本为 O(m log n),使使用以下实现的O(m)成本相形见war 无索引邻接.

To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.

然后说明Neo4j通过将所有节点和关系存储为固定大小的记录来实现此恒定时间查找:

It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:

对于固定大小的记录和类似指针的记录ID,遍历是 只需通过围绕数据结构跟踪指针即可实现, 可以以很高的速度执行.遍历一个特定的 从一个节点到另一个节点的关系,数据库执行多个 便宜的ID计算(这些计算比 搜索全局索引,就像伪造图形中的图形一样 非图形本机数据库)

With fixed sized records and pointer-like record IDs, traversals are implemented simply by chasing pointers around a data structure, which can be performed at very high speed. To traverse a particular relationship from one node to another, the database performs several cheap ID computations (these computations are much cheaper than searching global indexes, as we’d have to do if faking a graph in a non-graph native database)

这最后一句话引发了我的问题:使用Cassandra或HBase作为存储后端的Titan怎样实现这些性能提升或弥补这些性能提升?

This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?

推荐答案

仅当数据位于同一JVM中的内存中时,Neo4j才能实现O(1).当数据在磁盘上时,由于指针在磁盘上追逐(它们的磁盘表示不佳),Neo4j变慢.

Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).

仅当数据在同一JVM中处于内存中时,Titan才能实现O(1).当数据在磁盘上时,Titan比Neo4j快,因为它具有更好的磁盘表示.

Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.

请参阅以下博客文章,对上述内容进行定量解释: http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/

Please see the following blog post that explains the above quantitatively: http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/

因此,重要的是要理解人们何时说O(1)他们所在的内存层次结构的哪一部分.当您在单个JVM(单机)中时,Neo4j和Titan都可以轻松快速地演示它.它们各自的缓存引擎.当您无法将整个图形都放在内存中时,您必须依靠智能磁盘布局,分布式缓存等.

Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.

有关更多信息,请参见以下两个博客文章:

Please see the following two blog posts for more information:

http://thinkaurelius.com/2013 /11/01/a-letter-regarding-native-graph-databases/ http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/

这篇关于Titan如何使用HBase/Cassandra实现恒定时间查找?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆