从Titan(在HBase上)读入Spark的大图 [英] Reading a large graph from Titan (on HBase) into Spark
问题描述
据我所知,我可以使用Gremlin服务器来处理OLTP风格的查询,其中我的结果集很小。由于我的查询将由UI生成,因此我可以使用API与Gremlin服务器进行交互。到目前为止,这么好。
这个问题涉及到OLAP用例。由于HBase中的数据将与Spark执行程序位于同一位置,因此使用 HDFSInputFormat
将数据读入Spark是有效的。如果从驱动程序执行Gremlin查询,然后将数据分发回执行程序,那么效率不高(实际上不可能,事实上,不可能)。
我发现的最佳指导是来自Titan GitHub回购的未完成讨论( https:// github (至少对于Cassandra后端来说)标准 所以,我有两个问题: 2)有没有人设法阅读来自HBase的原始Titan图形并在Spark中进行重构(无论是在GraphX中还是作为DataFrames,RDD等)?如果是这样,你能给我任何指针吗? 我已经深入研究了这个主题,但是我发现任何解决方案( 我已经在相当大的范围内对它进行了测试 - 一个拥有数千亿元素的Titan图表,重约25TB。 b 因为它没有依靠HBase公开的Scan API,速度要快得多。例如,在我提到的图表中计算边缘大约需要 10小时。 I am researching Titan (on HBase) as a candidate for a large, distributed graph database. We require both OLTP access (fast, multi-hop queries over the graph) and OLAP access (loading all - or at least a large portion - of the graph into Spark for analytics). From what I understand, I can use the Gremlin server to handle OLTP-style queries where my result-set will be small. Since my queries will be generated by a UI I can use an API to interface with the Gremlin server. So far, so good. The problem concerns the OLAP use case. Since the data in HBase will be co-located with the Spark executors, it would be efficient to read the data into Spark using an The best guidance I have found is an un-concluded discussion from the Titan GitHub repo (https://github.com/thinkaurelius/titan/issues/1045) which suggests that (at least for a Cassandra back-end) the standard However, upon reading about the underlying Titan data model (http://s3.thinkaurelius.com/docs/titan/current/data-model.html) it appears that parts the "raw" graph data is serialized, with no explanation as to how to reconstruct a property graph from the contents. And so, I have two questions: 1) Is everything that I have stated above correct, or have I missed / misunderstood anything? 2) Has anyone managed to read a "raw" Titan graph from HBase and reconstruct it in Spark (either in GraphX or as DataFrames, RDDs etc)? If so, can you give me any pointers? About a year ago, I encountered the same challenge as you describe -- we had a very large Titan instance, but we could not run any OLAP processes on it. I have researched the subject pretty deeply, but any solution I found ( So I implemented Mizo - it is a Spark RDD for Titan on HBase, that bypasses HBase main API, and parses HBase internal data files (called HFiles). I have tested it on a pretty large scale -- a Titan graph with hundreds of billions of elements, weighing about 25TB. Because it does not rely on the Scan API that HBase exposes, it is much faster. For example, counting edges in the graph I mentioned takes about 10 hours. 这篇关于从Titan(在HBase上)读入Spark的大图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! TitanCassandraInputFormat
应该)阅读泰坦表的工作。没有任何关于HBase后端的声明。然而,在阅读Titan数据模型( http://s3.thinkaurelius.com/docs/titan/current/data-model.html )看起来,生图形数据是序列化的,没有解释如何从内容中重建属性图。
1)我所说的一切都是正确的,还是我错过了/误解了任何东西?
SparkGraphComputer
, TitanHBaseInputFormat
)要么很慢(我们的规模是几天或几周的事情),要么就是错误的和错过的数据。速度缓慢的主要原因是他们都使用了HBase主要的API,这就是速度瓶颈。 所以我实现了 Mizo - 它是HBase上Titan的Spark RDD,绕过了HBase主要API,并解析了HBase 内部数据文件(称为HFile)。
$ b HDFSInputFormat
. It would be inefficient (impossible, in fact, given the projected graph size) to execute a Gremlin query from the driver and then distribute the data back to the executors.TitanCassandraInputFormat
should work for reading Titan tables. Nothing is claimed about HBase backends.SparkGraphComputer
, TitanHBaseInputFormat
) was either very slow (matters of days or weeks in our scale) or just buggy and missed data. The main reason for the slowness was that all of them used HBase main API, which turned out as the speed bottleneck.