Gremlin-Giraph-GraphX吗?在TitanDb上 [英] Gremlin - Giraph - GraphX ? On TitanDb

查看:137
本文介绍了Gremlin-Giraph-GraphX吗?在TitanDb上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一些帮助来确认我的选择...并了解您是否可以给我一些信息. 我的存储数据库是Cassandra的TitanDb. 我有一个很大的图.我的目标是在后面的图上使用Mllib.

I need some help to be confirm my choice... and to learn if you can give me some information. My storage database is TitanDb with Cassandra. I have a very large graph. My goal is to use Mllib on the graph latter.

我的第一个想法:将Titan与GraphX一起使用,但是我没有发现任何东西或正在开发中……TinkerPop尚未准备好. 所以我来看看吉拉夫. TinkerPop,Titan可以与TinkerPop的Rexster进行通讯.

My first idea : use Titan with GraphX but I did not found anything or in development in progress... TinkerPop is not ready yet. So I have a look to Giraph. TinkerPop, Titan can communique with Rexster from TinkerPop.

我的问题是: 使用Giraph有什么好处?格林姆林似乎也有同样的想法,并且是分散的.

My question is : What are the benefit to use Giraph ? Gremlin seems to do the same think and is distributed.

非常感谢您向我解释.我想我不太了解Gremlin和Giraph(或GraphX)之间的区别.

Thank you very much to explain me. I think I don't really understand the difference between Gremlin and Giraph (or GraphX).

祝你有美好的一天.

推荐答案

有趣的问题.我在同一条轨道上.

Interesting question. I am on the same track.

首先,您对MLlib的问题.我假设您的意思是 Apache Spark MLlib ,这是Apache Spark之上的机器学习(ML)实现.因此,我的结论是:您想使用 Titan 中的数据,为聚类和分类之类的目的运行ML算法.一个基于a>/ Cassandra 的图形数据库. 请注意,您还可以使用诸如spidy提到的Page Rank之类的图形处理算法来在Titan/Cassandra图形数据库的顶部进行聚类.换句话说:当您的起点是图形数据库时,您不需要ML进行聚类.

First your question about MLlib. I assume that you mean Apache Spark MLlib, the machine learning (ML) implementation on top of Apache Spark. So my conclusion is: you want to run ML algorithms for purposes such as clustering and classification using the data in your Titan/Cassandra based graph database. Please note that you could also use graph processing algorithms like Page Rank mentioned by spidy to do things like clustering on top of your Titan/Cassandra graph database. In other words: you don't need ML to do clustering when your starting point is a graph database.

Apache Spark MLlib似乎是面向未来的,并得到了广泛的支持,尽管 Apache Mahout,但它们的最新公告是关于新的ML算法的. (另一个Apache ML项目)在支持的ML算法数量上更加成熟. Apache Mahout还采用Apache Spark作为其数据存储层,因此我在本文中对此进行了提及. 除了内存计算之外,Apache Spark还提供了上述提到的用于机器学习的MLlib, Spark SQL 就像Spark上的蜂巢一样, GraphX ,它是一个图形处理系统,由spidy和 Spark流技术用于处理流数据.

Apache Spark MLlib seems to be future proof and widely supported, their most recent announcements were regarding new ML algorithms, although Apache Mahout, another Apache ML project, is more mature regarding the amount of supported ML algorithms. Apache Mahout has also adopted Apache Spark as their data storage layer, so I therefore mention it in this post. Apache Spark offers, in addition to in-memory computing, the mentioned MLlib for machine learning, Spark SQL which is like Hive on Spark, GraphX which is a graph processing system as explained by spidy and Spark Streaming for processing of streaming data.

我认为Apache Spark本身是一个逻辑数据层,在诸如Cassandra,Hadoop/Hcatalog和HBase之类的存储层之上表示为RDD(弹性分布式数据集). Apache Spark提供了与Cassandra的连接器.请注意,RDD是不可变的,不能使用Spark更改数据,只能在Spark中处理和分析数据. 关于Apache Spark逻辑存储层RDD:您可以将RDD作为视图在旧的SQL时代中进行比较,RDD可以为您提供例如HBase的Cassandra中的表的视图.还请注意,Apache Spark为3种开发环境提供了API:Scala,Java和Python.

I consider Apache Spark itself as a logical data layer, represented as RDDs (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop/Hcatalog and HBase. Apache Spark offers a connector to Cassandra. Note that RDDs are immutable, you cannot alter data using Spark, you can only process and analyze the data in Spark. Regarding the Apache Spark logical storage layer RDD: You could compare an RDD as a view in the good old SQL times, RDDs give you a view on for example a table in Cassandra of HBase. Note also that Apache Spark offers an API for 3 development environments: Scala, Java and Python.

Apache Giraph 还是一种图形处理工具集,功能等同于Apache Spark GraphX. Apache Giraph使用Hadoop作为数据存储层.您使用的是Titan/Cassandra,因此当您选择Apache Giraph作为解决方案时,您可能会输入数据迁移任务.其次,您以关于使用MLlib的ML的问题开始您的帖子,而Apache Giraph不是ML解决方案.

Apache Giraph is also a graph processing toolset, functional equivalent to Apache Spark GraphX. Apache Giraph uses Hadoop as the data storage layer. You are using Titan/Cassandra so you will probably enter data migration tasks when you select Apache Giraph as your solution. Secondly, you started your post with a question regarding ML using MLlib and Apache Giraph is not a ML solution.

您关于Giraph和Gremlin的结论是不正确的:尽管两者都使用图形数据库,但它们并不相同. Giraph是spyy解释的用于图形处理的解决方案.使用Giraph,您可以执行图形分析算法,例如Page Rank,例如跟随者最多的人,而克里姆林宫则是为了遍历例如使用实体(顶点)之间的复杂关系(边)来查询图数据库,以获得顶点和边属性的结果集.

Your conclusion regarding Giraph and Gremlin is not correct: they are not the same although both are using a graph database. Giraph is a solution for graph processing as spidy explained. Using Giraph you can execute graph analysis algorithms such as Page Rank, e.g. who has the most followers, whilst Gremlin is meant for traversing e.g. queury the graph database using the complex relationships (edges) between entities (vertices) obtaining result sets of vertex and edge properties.

这篇关于Gremlin-Giraph-GraphX吗?在TitanDb上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆