从Titan(在HBase上)读入Spark的大图 [英] Reading a large graph from Titan (on HBase) into Spark

查看:177
本文介绍了从Titan(在HBase上)读入Spark的大图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究Titan(在HBase上)作为大型分布式图形数据库的候选人。我们需要OLTP访问(图中的快速多跳查询)和OLAP访问(将所有图表(或至少大部分图表)加载到Spark中以进行分析)。

据我所知,我可以使用Gremlin服务器来处理OLTP风格的查询,其中我的结果集很小。由于我的查询将由UI生成,因此我可以使用API​​与Gremlin服务器进行交互。到目前为止,这么好。



这个问题涉及到OLAP用例。由于HBase中的数据将与Spark执行程序位于同一位置,因此使用 HDFSInputFormat 将数据读入Spark是有效的。如果从驱动程序执行Gremlin查询,然后将数据分发回执行程序,那么效率不高(实际上不可能,事实上,不可能)。



我发现的最佳指导是来自Titan GitHub回购的未完成讨论( https:// github (至少对于Cassandra后端来说)标准 TitanCassandraInputFormat 应该)阅读泰坦表的工作。没有任何关于HBase后端的声明。然而,在阅读Titan数据模型( http://s3.thinkaurelius.com/docs/titan/current/data-model.html )看起来,生图形数据是序列化的,没有解释如何从内容中重建属性图。

所以,我有两个问题:



1)我所说的一切都是正确的,还是我错过了/误解了任何东西?


2)有没有人设法阅读来自HBase的原始Titan图形并在Spark中进行重构(无论是在GraphX中还是作为DataFrames,RDD等)?如果是这样,你能给我任何指针吗?

我们有一个非常大的Titan实例,但我们无法在其上运行任何OLAP进程。

我已经深入研究了这个主题,但是我发现任何解决方案( SparkGraphComputer TitanHBaseInputFormat )要么很慢(我们的规模是几天或几周的事情),要么就是错误的和错过的数据。速度缓慢的主要原因是他们都使用了HBase主要的API,这就是速度瓶颈。 所以我实现了 Mizo - 它是HBase上Titan的Spark RDD,绕过了HBase主要API,并解析了HBase 内部数据文件(称为HFile)。

我已经在相当大的范围内对它进行了测试 - 一个拥有数千亿元素的Titan图表,重约25TB。 b
$ b

因为它没有依靠HBase公开的Scan API,速度要快得多。例如,在我提到的图表中计算边缘大约需要 10小时


I am researching Titan (on HBase) as a candidate for a large, distributed graph database. We require both OLTP access (fast, multi-hop queries over the graph) and OLAP access (loading all - or at least a large portion - of the graph into Spark for analytics).

From what I understand, I can use the Gremlin server to handle OLTP-style queries where my result-set will be small. Since my queries will be generated by a UI I can use an API to interface with the Gremlin server. So far, so good.

The problem concerns the OLAP use case. Since the data in HBase will be co-located with the Spark executors, it would be efficient to read the data into Spark using an HDFSInputFormat. It would be inefficient (impossible, in fact, given the projected graph size) to execute a Gremlin query from the driver and then distribute the data back to the executors.

The best guidance I have found is an un-concluded discussion from the Titan GitHub repo (https://github.com/thinkaurelius/titan/issues/1045) which suggests that (at least for a Cassandra back-end) the standard TitanCassandraInputFormat should work for reading Titan tables. Nothing is claimed about HBase backends.

However, upon reading about the underlying Titan data model (http://s3.thinkaurelius.com/docs/titan/current/data-model.html) it appears that parts the "raw" graph data is serialized, with no explanation as to how to reconstruct a property graph from the contents.

And so, I have two questions:

1) Is everything that I have stated above correct, or have I missed / misunderstood anything?

2) Has anyone managed to read a "raw" Titan graph from HBase and reconstruct it in Spark (either in GraphX or as DataFrames, RDDs etc)? If so, can you give me any pointers?

解决方案

About a year ago, I encountered the same challenge as you describe -- we had a very large Titan instance, but we could not run any OLAP processes on it.

I have researched the subject pretty deeply, but any solution I found (SparkGraphComputer, TitanHBaseInputFormat) was either very slow (matters of days or weeks in our scale) or just buggy and missed data. The main reason for the slowness was that all of them used HBase main API, which turned out as the speed bottleneck.

So I implemented Mizo - it is a Spark RDD for Titan on HBase, that bypasses HBase main API, and parses HBase internal data files (called HFiles).

I have tested it on a pretty large scale -- a Titan graph with hundreds of billions of elements, weighing about 25TB.

Because it does not rely on the Scan API that HBase exposes, it is much faster. For example, counting edges in the graph I mentioned takes about 10 hours.

这篇关于从Titan(在HBase上)读入Spark的大图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆