Cassandra和MapReduce - 最小的设置要求 [英] Cassandra and MapReduce - minimal setup requirements

查看:140
本文介绍了Cassandra和MapReduce - 最小的设置要求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在我的Cassandra集群上执行MapReduce,包括数据本地化,即。每个作业仅查询属于作业运行的本地Casandra Node的行。

I need to execute MapReduce on my Cassandra cluster, including data locality, ie. each job queries only rows which belong to local Casandra Node where the job runs.

教程存在,介绍如何在旧的Cassandra版本(0.7)上设置MR的Hadoop。我找不到这样的当前版本。

Tutorials exist, on how to setup Hadoop for MR on older Cassandra version (0.7). I cannot find such for current release.

在这方面自从0.7以来发生了什么变化?

What has changed since 0.7 in this regard ?

最小安装需要什么软件模块(Hadoop + HDFS + ...)?

What software modules are required for minimal setup (Hadoop+HDFS+...)?

我需要Cassandra Enterprise吗?

Do I need Cassandra Enterprise ?

推荐答案

p> Cassandra包含几个类,足以与Hadoop集成:

Cassandra contains a few classes which are sufficient to integrate with Hadoop:


  • ColumnFamilyInputFormat - 这是一个Map函数的输入,当使用Cassandra的随机分区器时,它可以从单个CF中读取所有行,或者当使用Cassandra的有序分区器时,它可以读取行范围。 Cassandra集群具有环形,其中每个环部分负责具体的键范围。输入格式的主要任务是将Map输入划分为可以并行处理的数据部分 - 这些被称为 InputSplits 。在Cassandra的情况下,这很简单 - 每个环形范围都有一个主节点,这意味着输入格式将为每个环形元素创建一个 InputSplit ,这将导致一个Map任务。现在我们想在存储数据的同一主机上执行我们的Map任务。每个 InputSplit 记住其环部分的IP地址 - 这是Cassandra节点负责此特定键范围的IP地址。 JobTracker 将创建 InputSplits 的Map任务,并将它们分配给 TaskTracker 执行。 JobTracker 将尝试查找与 InputSplit 具有相同IP地址的 TaskTracker $ c> - 基本上我们必须在Cassandra主机上启动 TaskTracker ,这将保证数据的本地化。

  • ColumnFamilyOutputFormat - 这会为Reduce函数配置上下文。以便结果可以存储在Cassandra中

  • 所有Map函数的结果必须合并在一起才能传递给reduce函数 - 这称为shuffle。它使用本地文件系统 - 从Cassandra透视图这里没有什么要做,我们只需要配置到本地临时目录的路径。此外,没有必要用其他东西替换这个解决方案(例如在Cassandra中持久化) - 这个数据不需要复制,Map任务是幂等的。

  • ColumnFamilyInputFormat - This is an input for a Map function which can read all rows from a single CF in when using Cassandra's random partitioner, or it can read a row range when used with Cassandra's ordered partitioner. Cassandra cluster has ring form, where each ring part is responsible for concrete key range. Main task of Input Format is to divide Map input into data parts which can be processed in parallel - those are called InputSplits. In Cassandra case this is simple - each ring range has one master node, and this means that Input Format will create one InputSplit for each ring element, and it will result in one Map task. Now we would like to execute our Map task on the same host where data is stored. Each InputSplit remembers IP address of its ring part - this is the IP address of Cassandra node responsible to this particular key range. JobTracker will create Map tasks form InputSplits and assign them to TaskTracker for execution. JobTracker will try to find TaskTracker which has the same IP address as InputSplit - basically we have to start TaskTracker on Cassandra host, and this will guarantee data locality.
  • ColumnFamilyOutputFormat - this configures context for Reduce function. So that the results can be stored in Cassandra
  • Results from all Map functions has to be combined together before they can be passed to reduce function - this is called shuffle. It uses local file system - from Cassandra perspective nothing has to be done here, we just need to configure path to local temp directory. Also there is no need to replace this solution with something else (like persisting in Cassandra) - this data does not have to be replicated, Map tasks are idempotent.

基本上使用提供的Hadoop集成可以在数据驻留的主机上执行Map作业,Reduce函数可以将结果存储回Cassandra - 这是我需要的。

Basically using provided Hadoop integration gives up possibility to execute Map job on hosts where data resides, and Reduce function can store results back into Cassandra - it's all that I need.

执行Map-Reduce有两种可能性:

There are two possibilities to execute Map-Reduce:


  • org.apache.hadoop.mapreduce.Job - 此类在一个进程中模拟Hadoop。它执行Map-Resuce任务,不需要任何额外的服务/依赖,它只需要访问temp目录来存储来自map作业的结果。基本上,我们必须调用Job类上的少量setters,其中包含类的名称为Map任务,减少任务,输入格式,Cassandra连接,当设置完成 job.waitForCompletion(true)必须被调用 - 它启动Map-Reduce任务并等待结果。这个解决方案可以用来快速进入Hadoop世界,并进行测试。

  • 真正的Hadoop集群 - 我没有设置它,但是,但是我明白,从前面的例子中的Map-Reduce作业将正常工作。我们还需要HDFS,它将用于在Hadoop集群中分发包含Map-Reduce类的jar。

  • org.apache.hadoop.mapreduce.Job - this class simulates Hadoop in one process. It executes Map-Resuce task and does not require any additional services/dependencies, it needs only access to temp directory to store results from map job for shuffle. Basically we have to call few setters on Job class, which contain things like class names for Map task, Reduce task, input format, Cassandra connection, when setup is done job.waitForCompletion(true) has to be called - it starts Map-Reduce task and waits for results. This solution can be used to quickly get into Hadoop world, and for testing. It will not scale (single process), and it will fetch data over network, but still - it will be fine for beginning.
  • Real Hadoop cluster - I did not set it up yet, but as I understood, Map-Reduce jobs from previous example will work just fine. We need additionally HDFS which will be used to distribute jars containing Map-Reduce classes in Hadoop cluster.

这篇关于Cassandra和MapReduce - 最小的设置要求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆