用于Spark集群和Cassandra的Titan的设置和配置 [英] Setup and configuration of Titan for a Spark cluster and Cassandra

查看:79
本文介绍了用于Spark集群和Cassandra的Titan的设置和配置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在aurelius邮件列表上以及在stackoverflow上,已经存在一些有关配置Titan使其与Spark配合使用的特定问题的问题.但是我认为缺少的是对使用Titan和Spark的简单设置的高级描述.

我正在寻找的是使用建议设置的最小设置.例如,对于Cassandra,复制因子应为3,并且应使用专用数据中心进行分析.

根据我在Spark,Titan和Cassandra文档中找到的信息,这样的最小设置如下所示:

  • 实时处理DC:使用Titan + Cassandra的3个节点(RF:3个)
  • Analytics DC:具有Cassandra的1个Spark主控+ 3个Spark从属(RF:3)

我对该安装程序和Titan + Spark总体上有一些疑问:

  1. 设置是否正确?
  2. Titan是否也应安装在3个Spark从节点和/或Spark主节点上?
  3. 是否还有其他设置可供您使用?
  4. Spark从站是否只从分析DC读取数据,理想情况下甚至从同一节点上的Cassandra读取数据?

也许有人甚至可以共享一个支持这种设置(或更好的设置)的配置文件.

解决方案

所以我尝试了一下,并设置了一个简单的Spark集群以与Titan(和Cassandra作为存储后端)一起使用,这就是我想出的:

高级概述

我只是在这里专注于集群的分析方面,所以我列出了实时处理节点.

Spark由一个(或多个)主节点和多个从属节点(工人)组成.由于从站执行实际的处理,因此它们需要访问工作的数据.因此,Cassandra已安装在工作人员上,并保存了Titan的图形数据.

工作从Titan节点发送到Spark Master,后者将其分发给他的工人.因此,Titan基本上只与Spark主设备通信.

仅因为TinkerPop将中间结果存储在其中,所以才需要HDFS.请注意,在TinkerPop 3.2.0中已更改.

安装

HDFS

我刚刚按照我发现的教程进行操作

HADOOP_CONF_DIR添加到您的CLASSPATH中,TinkerPop应该可以访问HDFS. TinkerPop文档包含有关此信息以及如何检查是否配置了HDFS的更多信息.正确.

最后,一个对我有用的配置文件:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# Titan Cassandra InputFormat configuration
#
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3
titanmr.ioformat.conf.storage.port=9160
titanmr.ioformat.conf.storage.keyspace=titan
titanmr.ioformat.cf-name=edgestore

#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=titan
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647

#
# SparkGraphComputer Configuration
#
spark.master=spark://COORDINATOR:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer

答案

这将导致以下答案:

该设置正确吗?

似乎是.至少它适用于此设置.

是否还应该在3个Spark从节点和/或Spark主节点上安装Titan?

由于不是必需的,因此我不会这样做,因为我更喜欢将用户可以访问的Spark服务器和Titan服务器分开.

是否还有其他设置可供您使用?

很高兴听到其他人使用其他设置.

Spark从站是否只会从分析DC读取数据,理想情况下甚至是从同一节点上的Cassandra读取数据?

由于明确配置了Cassandra节点(来自分析DC),因此Spark从站不应从完全不同的节点中提取数据.但是我仍然不确定第二部分.也许其他人可以在这里提供更多见解?

There are already several questions on the aurelius mailing list as well as here on stackoverflow about specific problems with configuring Titan to get it working with Spark. But what is missing in my opinion is a high-level description of a simple setup that uses Titan and Spark.

What I am looking for is a somewhat minimal setup that uses recommended settings. For example for Cassandra, the replication factor should be 3 and a dedicated datacenter should be used for analytics.

From the information I found in the documentation of Spark, Titan, and Cassandra, such a minimal setup could look like this:

  • Real-time processing DC: 3 Nodes with Titan + Cassandra (RF: 3)
  • Analytics DC: 1 Spark master + 3 Spark slaves with Cassandra (RF: 3)

Some questions I have about that setup and Titan + Spark in general:

  1. Is that setup correct?
  2. Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master?
  3. Is there another setup that you would use instead?
  4. Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node?

Maybe someone can even share a config file that supports such a setup (or a better one).

解决方案

So I just tried it out and set up a simple Spark cluster to work with Titan (and Cassandra as the storage backend) and here is what I came up with:

High-Level Overview

I just concentrate on the analytics side of the cluster here, so I let out the real-time processing nodes.

Spark consists of one (or more) master and multiple slaves (workers). Since the slaves do the actual processing, they need to access the data they work on. Therefore Cassandra is installed on the workers and holds the graph data from Titan.

Jobs are sent from Titan nodes to the spark master who distributes them to his workers. Therefore, Titan basically only communicates with the Spark master.

The HDFS is only needed because TinkerPop stores intermediate results in it. Note, that this changed in TinkerPop 3.2.0.

Installation

HDFS

I just followed a tutorial I found here. There are only two things to keep in mind here for Titan:

  • Choose a compatible version, for Titan 1.0.0, this is 1.2.1.
  • TaskTrackers and JobTrackers from Hadoop are not needed, as we only want the HDFS and not MapReduce.

Spark

Again, the version has to be compatible, which is also 1.2.1 for Titan 1.0.0. Installation basically means extracting the archive with a compiled version. In the end, you can configure Spark to use your HDFS by exporting the HADOOP_CONF_DIR which should point to the conf directory of Hadoop.

Configuration of Titan

You also need a HADOOP_CONF_DIR on the Titan node from which you want to start OLAP jobs. It needs to contain a core-site.xml file that specifies the NameNode:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
     <name>fs.default.name</name>
     <value>hdfs://COORDINATOR:54310</value>
     <description>The name of the default file system.  A URI whose
       scheme and authority determine the FileSystem implementation.  The
       uri's scheme determines the config property (fs.SCHEME.impl) naming
       the FileSystem implementation class.  The uri's authority is used to
       determine the host, port, etc. for a filesystem.</description>
  </property>
</configuration>

Add the HADOOP_CONF_DIR to your CLASSPATH and TinkerPop should be able to access the HDFS. The TinkerPop documentation contains more information about that and how to check whether HDFS is configured correctly.

Finally, a config file that worked for me:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# Titan Cassandra InputFormat configuration
#
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3
titanmr.ioformat.conf.storage.port=9160
titanmr.ioformat.conf.storage.keyspace=titan
titanmr.ioformat.cf-name=edgestore

#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=titan
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647

#
# SparkGraphComputer Configuration
#
spark.master=spark://COORDINATOR:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer

Answers

This leads to the following answers:

Is that setup correct?

It seems to be. At least it works with this setup.

Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master?

Since it isn't required, I wouldn't do that as I prefer a separation of Spark and Titan servers which the user can access.

Is there another setup that you would use instead?

I would be happy to hear from someone else who has a different setup.

Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node?

Since the Cassandra nodes (from the analytics DC) are explicitly configured, the Spark slaves shouldn't be able to pull data from completely different nodes. But I am still not sure about the second part. Maybe someone else can provide more insight here?

这篇关于用于Spark集群和Cassandra的Titan的设置和配置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆