为 Spark 集群和 Cassandra 设置和配置 Titan [英] Setup and configuration of Titan for a Spark cluster and Cassandra

查看:15
本文介绍了为 Spark 集群和 Cassandra 设置和配置 Titan的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 aurelius 邮件列表以及这里在 stackoverflow 上已经有几个关于配置 Titan 以使其与 Spark 一起工作的特定问题的问题.但我认为缺少的是对使用 Titan 和 Spark 的简单设置的高级描述.

我正在寻找的是使用推荐设置的最小设置.例如,对于 Cassandra,复制因子应为 3,并且应使用专用数据中心进行分析.

根据我在 Spark、Titan 和 Cassandra 的文档中找到的信息,这样一个最小的设置可能如下所示:

  • 实时处理 DC:具有 Titan + Cassandra 的 3 个节点(RF:3)
  • Analytics DC:1 个 Spark 主站 + 3 个 Spark 从站与 Cassandra(RF:3)

我对这个设置和 Titan + Spark 有一些疑问:

  1. 那个设置正确吗?
  2. Titan 是否也应该安装在 3 个 Spark 从节点和/或 Spark 主节点上?
  3. 还有其他设置可以替代吗?
  4. Spark 从站是否仅从分析 DC 读取数据,理想情况下甚至从同一节点上的 Cassandra 读取数据?

也许有人甚至可以共享支持这种设置(或更好的设置)的配置文件.

解决方案

所以我只是尝试了一下,并设置了一个简单的 Spark 集群来与 Titan(和 Cassandra 作为存储后端)一起工作,这就是我想出的:

高级概述

这里我只专注于集群的分析方面,所以我放出了实时处理节点.

Spark 由一个(或多个)master 和多个 slave(worker)组成.由于从站进行实际处理,因此它们需要访问它们处理的数据.因此 Cassandra 安装在 worker 上并保存来自 Titan 的图形数据.

作业从 Titan 节点发送到 spark master,后者将它们分发给他的工人.因此,Titan 基本上只与 Spark master 通信.

只需要 HDFS,因为 TinkerPop 将中间结果存储在其中.请注意,这一点在 TinkerPop 3.2.0 中发生了变化.>

安装

HDFS

我只是按照我发现的教程 这里.对于 Titan,这里只需要记住两件事:

  • 选择一个兼容版本,对于 Titan 1.0.0,这是 1.2.1.
  • 不需要来自 Hadoop 的 TaskTracker 和 JobTracker,因为我们只需要 HDFS 而不是 MapReduce.

火花

同样,版本必须兼容,Titan 1.0.0 也是 1.2.1.安装基本上意味着使用编译版本提取存档.最后,您可以通过导出应该指向 Hadoop 的 conf 目录的 HADOOP_CONF_DIR 来配置 Spark 以使用您的 HDFS.

泰坦的配置

您还需要在 Titan 节点上有一个 HADOOP_CONF_DIR,以便从中启动 OLAP 作业.它需要包含一个指定 NameNode 的 core-site.xml 文件:

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><配置><财产><name>fs.default.name</name><value>hdfs://COORDINATOR:54310</value><描述> 默认文件系统的名称.一个 URI,其方案和权限决定了文件系统的实现.这uri 的方案决定了配置属性 (fs.SCHEME.impl) 命名FileSystem 实现类.uri的权限用于确定文件系统的主机、端口等.</description></属性></配置>

HADOOP_CONF_DIR 添加到您的 CLASSPATH 中,TinkerPop 应该能够访问 HDFS.TinkerPop 文档 包含更多相关信息以及如何检查是否配置了 HDFS正确.

最后,一个对我有用的配置文件:

<代码>## Hadoop 图配置#gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraphgremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormatgremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormatgremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormatgremlin.hadoop.deriveMemory=falsegremlin.hadoop.jarsInDistributedCache=truegremlin.hadoop.inputLocation=nonegremlin.hadoop.outputLocation=输出## Titan Cassandra InputFormat 配置#titanmr.ioformat.conf.storage.backend=cassandrathrifttitanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3titanmr.ioformat.conf.storage.port=9160titanmr.ioformat.conf.storage.keyspace=titantitanmr.ioformat.cf-name=edgestore## Apache Cassandra InputFormat 配置#cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitionercassandra.input.keyspace=titancassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047ffffffff0000cassandra.input.columnfamily=edgestorecassandra.range.batch.size=2147483647## SparkGraphComputer 配置#spark.master=spark://COORDINATOR:7077spark.serializer=org.apache.spark.serializer.KryoSerializer

答案

由此得出以下答案:

<块引用>

那个设置正确吗?

好像是.至少它适用于这种设置.

<块引用>

Titan 是否也应该安装在 3 个 Spark 从节点和/或 Spark 主节点上?

因为它不是必需的,所以我不会这样做,因为我更喜欢将用户可以访问的 Spark 和 Titan 服务器分开.

<块引用>

还有其他设置可以替代吗?

我很高兴收到其他设置不同的人的意见.

<块引用>

Spark 从节点是否仅从分析 DC 读取数据,理想情况下甚至从同一节点上的 Cassandra 读取数据?

由于 Cassandra 节点(来自分析 DC)是明确配置的,Spark 从节点不应该能够从完全不同的节点中提取数据.但我仍然不确定第二部分.也许其他人可以在这里提供更多见解?

There are already several questions on the aurelius mailing list as well as here on stackoverflow about specific problems with configuring Titan to get it working with Spark. But what is missing in my opinion is a high-level description of a simple setup that uses Titan and Spark.

What I am looking for is a somewhat minimal setup that uses recommended settings. For example for Cassandra, the replication factor should be 3 and a dedicated datacenter should be used for analytics.

From the information I found in the documentation of Spark, Titan, and Cassandra, such a minimal setup could look like this:

  • Real-time processing DC: 3 Nodes with Titan + Cassandra (RF: 3)
  • Analytics DC: 1 Spark master + 3 Spark slaves with Cassandra (RF: 3)

Some questions I have about that setup and Titan + Spark in general:

  1. Is that setup correct?
  2. Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master?
  3. Is there another setup that you would use instead?
  4. Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node?

Maybe someone can even share a config file that supports such a setup (or a better one).

解决方案

So I just tried it out and set up a simple Spark cluster to work with Titan (and Cassandra as the storage backend) and here is what I came up with:

High-Level Overview

I just concentrate on the analytics side of the cluster here, so I let out the real-time processing nodes.

Spark consists of one (or more) master and multiple slaves (workers). Since the slaves do the actual processing, they need to access the data they work on. Therefore Cassandra is installed on the workers and holds the graph data from Titan.

Jobs are sent from Titan nodes to the spark master who distributes them to his workers. Therefore, Titan basically only communicates with the Spark master.

The HDFS is only needed because TinkerPop stores intermediate results in it. Note, that this changed in TinkerPop 3.2.0.

Installation

HDFS

I just followed a tutorial I found here. There are only two things to keep in mind here for Titan:

  • Choose a compatible version, for Titan 1.0.0, this is 1.2.1.
  • TaskTrackers and JobTrackers from Hadoop are not needed, as we only want the HDFS and not MapReduce.

Spark

Again, the version has to be compatible, which is also 1.2.1 for Titan 1.0.0. Installation basically means extracting the archive with a compiled version. In the end, you can configure Spark to use your HDFS by exporting the HADOOP_CONF_DIR which should point to the conf directory of Hadoop.

Configuration of Titan

You also need a HADOOP_CONF_DIR on the Titan node from which you want to start OLAP jobs. It needs to contain a core-site.xml file that specifies the NameNode:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
     <name>fs.default.name</name>
     <value>hdfs://COORDINATOR:54310</value>
     <description>The name of the default file system.  A URI whose
       scheme and authority determine the FileSystem implementation.  The
       uri's scheme determines the config property (fs.SCHEME.impl) naming
       the FileSystem implementation class.  The uri's authority is used to
       determine the host, port, etc. for a filesystem.</description>
  </property>
</configuration>

Add the HADOOP_CONF_DIR to your CLASSPATH and TinkerPop should be able to access the HDFS. The TinkerPop documentation contains more information about that and how to check whether HDFS is configured correctly.

Finally, a config file that worked for me:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# Titan Cassandra InputFormat configuration
#
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.hostname=WORKER1,WORKER2,WORKER3
titanmr.ioformat.conf.storage.port=9160
titanmr.ioformat.conf.storage.keyspace=titan
titanmr.ioformat.cf-name=edgestore

#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=titan
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647

#
# SparkGraphComputer Configuration
#
spark.master=spark://COORDINATOR:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer

Answers

This leads to the following answers:

Is that setup correct?

It seems to be. At least it works with this setup.

Should Titan also be installed on the 3 Spark slave nodes and / or the Spark master?

Since it isn't required, I wouldn't do that as I prefer a separation of Spark and Titan servers which the user can access.

Is there another setup that you would use instead?

I would be happy to hear from someone else who has a different setup.

Will the Spark slaves only read data from the analytics DC and ideally even from Cassandra on the same node?

Since the Cassandra nodes (from the analytics DC) are explicitly configured, the Spark slaves shouldn't be able to pull data from completely different nodes. But I am still not sure about the second part. Maybe someone else can provide more insight here?

这篇关于为 Spark 集群和 Cassandra 设置和配置 Titan的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆