Cassandra 和 MapReduce - 最低设置要求 [英] Cassandra and MapReduce - minimal setup requirements

查看:23
本文介绍了Cassandra 和 MapReduce - 最低设置要求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要在我的 Cassandra 集群上执行 MapReduce,包括数据局部性,即.每个作业只查询属于作业运行所在的本地卡桑德拉节点的行.

I need to execute MapReduce on my Cassandra cluster, including data locality, ie. each job queries only rows which belong to local Casandra Node where the job runs.

存在有关如何在较旧的 Cassandra 版本 (0.7) 上为 MR 设置 Hadoop 的教程.我在当前版本中找不到这样的.

Tutorials exist, on how to setup Hadoop for MR on older Cassandra version (0.7). I cannot find such for current release.

自 0.7 以来在这方面发生了哪些变化?

What has changed since 0.7 in this regard ?

最小设置(Hadoop+HDFS+...)需要哪些软件模块?

What software modules are required for minimal setup (Hadoop+HDFS+...)?

我需要 Cassandra Enterprise 吗?

Do I need Cassandra Enterprise ?

推荐答案

Cassandra 包含一些足以与 Hadoop 集成的类:

Cassandra contains a few classes which are sufficient to integrate with Hadoop:

  • ColumnFamilyInputFormat - 这是 Map 函数的输入,它可以在使用 Cassandra 的随机分区器时从单个 CF 中读取所有行,或者在与 Cassandra 的有序分区器一起使用时可以读取行范围.Cassandra 集群具有环形形式,其中每个环形部分负责具体的密钥范围.Input Format 的主要任务是将 Map 输入分成可以并行处理的数据部分 - 这些部分称为 InputSplits.在 Cassandra 的情况下,这很简单——每个环范围都有一个主节点,这意味着 Input Format 将为每个环元素创建一个 InputSplit,并且它会产生一个 Map 任务.现在我们想在存储数据的同一台主机上执行我们的 Map 任务.每个 InputSplit 记住其环部分的 IP 地址 - 这是负责该特定键范围的 Cassandra 节点的 IP 地址.JobTracker 将从 InputSplits 中创建 Map 任务并将它们分配给 TaskTracker 执行.JobTracker 将尝试找到与 InputSplit 具有相同 IP 地址的 TaskTracker - 基本上我们必须启动 TaskTrackerCassandra 主机,这将保证数据的本地性.
  • ColumnFamilyOutputFormat - 这为 Reduce 函数配置上下文.以便将结果存储在 Cassandra 中
  • 所有 Map 函数的结果必须组合在一起,然后才能传递给 reduce 函数 - 这称为 shuffle.它使用本地文件系统 - 从 Cassandra 的角度来看,这里不需要做任何事情,我们只需要配置本地临时目录的路径.此外,无需用其他方法替换此解决方案(例如在 Cassandra 中持久化)- 不必复制此数据,Map 任务是幂等的.
  • ColumnFamilyInputFormat - This is an input for a Map function which can read all rows from a single CF in when using Cassandra's random partitioner, or it can read a row range when used with Cassandra's ordered partitioner. Cassandra cluster has ring form, where each ring part is responsible for concrete key range. Main task of Input Format is to divide Map input into data parts which can be processed in parallel - those are called InputSplits. In Cassandra case this is simple - each ring range has one master node, and this means that Input Format will create one InputSplit for each ring element, and it will result in one Map task. Now we would like to execute our Map task on the same host where data is stored. Each InputSplit remembers IP address of its ring part - this is the IP address of Cassandra node responsible to this particular key range. JobTracker will create Map tasks form InputSplits and assign them to TaskTracker for execution. JobTracker will try to find TaskTracker which has the same IP address as InputSplit - basically we have to start TaskTracker on Cassandra host, and this will guarantee data locality.
  • ColumnFamilyOutputFormat - this configures context for Reduce function. So that the results can be stored in Cassandra
  • Results from all Map functions has to be combined together before they can be passed to reduce function - this is called shuffle. It uses local file system - from Cassandra perspective nothing has to be done here, we just need to configure path to local temp directory. Also there is no need to replace this solution with something else (like persisting in Cassandra) - this data does not have to be replicated, Map tasks are idempotent.

基本上使用提供的 Hadoop 集成放弃了在数据所在的主机上执行 Map 作业的可能性,并且 Reduce 函数可以将结果存储回 Cassandra - 这就是我所需要的.

Basically using provided Hadoop integration gives up possibility to execute Map job on hosts where data resides, and Reduce function can store results back into Cassandra - it's all that I need.

执行 Map-Reduce 有两种可能:

There are two possibilities to execute Map-Reduce:

  • org.apache.hadoop.mapreduce.Job - 这个类在一个进程中模拟 Hadoop.它执行 Map-Resuce 任务并且不需要任何额外的服务/依赖项,它只需要访问临时目录来存储地图作业的结果以进行 shuffle.基本上我们必须在 Job 类上调用一些 setter,其中包含 Map 任务、Reduce 任务、输入格式、Cassandra 连接的类名,当设置完成时 job.waitForCompletion(true) 必须是调用 - 它启动 Map-Reduce 任务并等待结果.该解决方案可用于快速进入 Hadoop 世界,并用于测试.它不会扩展(单个进程),并且会通过网络获取数据,但仍然 - 开始就可以了.
  • 真正的 Hadoop 集群 - 我还没有设置它,但据我所知,前一个示例中的 Map-Reduce 作业可以正常工作.我们还需要额外的 HDFS,它将用于在 Hadoop 集群中分发包含 Map-Reduce 类的 jar.
  • org.apache.hadoop.mapreduce.Job - this class simulates Hadoop in one process. It executes Map-Resuce task and does not require any additional services/dependencies, it needs only access to temp directory to store results from map job for shuffle. Basically we have to call few setters on Job class, which contain things like class names for Map task, Reduce task, input format, Cassandra connection, when setup is done job.waitForCompletion(true) has to be called - it starts Map-Reduce task and waits for results. This solution can be used to quickly get into Hadoop world, and for testing. It will not scale (single process), and it will fetch data over network, but still - it will be fine for beginning.
  • Real Hadoop cluster - I did not set it up yet, but as I understood, Map-Reduce jobs from previous example will work just fine. We need additionally HDFS which will be used to distribute jars containing Map-Reduce classes in Hadoop cluster.

这篇关于Cassandra 和 MapReduce - 最低设置要求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆