在卡桑德拉/ HDFS和星火运动数据 [英] Data motion in Cassandra/HDFS and Spark

查看:151
本文介绍了在卡桑德拉/ HDFS和星火运动数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在设计一个分布式存储和分析的体系结构,它是一种常见的使用方式,在同一台计算机中的数据节点上运行分析引擎? 特别是,它将使意义上直接卡桑德拉/ HDFS节点上运行的Spark /风暴?

When designing a distributed storage and analytics architecture, is it a common usage pattern to run an analytics engine on the same machine as the data nodes? Specifically, would it make sense to run Spark/Storm directly on Cassandra/HDFS nodes?

我知道,在HDFS马preduce有这种使用模式,因为<一href=\"http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.3/bk_using-apache-hadoop/content/yarn_overview.html\"相对=nofollow>根据Hortonworks ,纱最大限度地减少数据移动。我不知道这是否是尽管这些其他系统的情况。我猜想这是因为他们似乎是互相如此可插拔的,但我似乎无法找到这个网上的任何信息。

I know that MapReduce on HDFS has this sort of usage pattern since according to Hortonworks, YARN minimizes data motion. I have no idea whether this is the case with these other systems though. I would imagine it is since they seem to be so pluggable with each other, but I can't seem to find any information about this online.

我是那种对这个主题的新手,所以任何资源或答案会大大AP preciated。

I'm sort of a newbie on this topic, so any resources or answers would be greatly appreciated.

感谢

推荐答案

是很有意义的卡桑德拉节点上运行星火尽量减少机器之间的数据移动。

Yes it makes sense to run Spark on Cassandra nodes to minimize data movement between machines.

当您从卡桑德拉表中创建RDD,该RDD分区将被从本地到每台机器令牌创建范围

When you create an RDD from a Cassandra table, the RDD partitions will be created from the token ranges that are local to each machine.

下面是谈话中的链接这一主题对于Spark卡桑德拉连接器:

Here's a link to a talk on this subject for the Spark Cassandra connector:

<一个href=\"https://spark-summit.org/2015/events/cassandra-and-spark-optimizing-for-data-locality/\">Cassandra和星火:优化数据局部性

因为它说的总结:只有三件事情是在一个分布式数据库做分析的重要:地点,地点和位置

As it says in the summary: "There are only three things that are important in doing analytics on a distributed database: Locality, locality and locality."

这篇关于在卡桑德拉/ HDFS和星火运动数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆