猪&卡桑德拉&DataStax 拆分控件 [英] Pig & Cassandra & DataStax Splits Control

查看:22
本文介绍了猪&卡桑德拉&DataStax 拆分控件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用 Pig 和我的 Cassandra 数据来完成各种惊人的分组壮举,这些壮举几乎不可能通过命令式编写.我正在使用 DataStax 的 Hadoop & 集成.Cassandra,我不得不说它令人印象深刻.向那些家伙们致敬!!

I have been using Pig with my Cassandra data to do all kinds of amazing feats of groupings that would be almost impossible to write imperatively. I am using DataStax's integration of Hadoop & Cassandra, and I have to say it is quite impressive. Hat-off to those guys!!

我有一个非常小的沙箱集群(2 个节点),我将在其中对该系统进行一些测试.我有一个 CQL 表,它有大约 53M 行(大约 350 字节 ea.),我注意到 Mapper 稍后需要很长时间来处理这些 53M 行.我开始查看日志,我可以看到地图反复溢出(我看到映射器有 177 次溢出),我认为这是问题的一部分.

I have a pretty small sandbox cluster (2-nodes) where I am putting this system thru some tests. I have a CQL table that has ~53M rows (about 350 bytes ea.), and I notice that the Mapper later takes a very long time to grind thru these 53M rows. I started poking around the logs and I can see that the map is spilling repeatedly (i saw 177 spills from the mapper), and I think this is part of the problem.

CassandraInputFormat 和 JobConfig 的组合只创建了一个映射器,所以这个映射器必须从表中读取 100% 的行.我称之为反平行 :)

The combination of CassandraInputFormat and JobConfig only create a single mapper, so this mapper has to read 100% of the rows from the table. I call this anti-parallel :)

现在,这张图片中有很多齿轮在工作,包括:

Now, there are a lot of gears at work in this picture, including:

  • 2 个物理节点
  • hadoop 节点位于分析"DC(默认配置)中,但物理上位于同一机架中.
  • 我可以使用 LOCAL_QUORUM 查看作业

有人能指出我如何让​​ Pig 创建更多输入 Splits 以便我可以运行更多映射器的方向吗?我有 23 个插槽;一直只用一个好像有点可惜.

Can anybody point me in the direction of how to get Pig to create more Input Splits so I can run more mappers? I have 23 slots; seems a pity to only use one all the time.

或者,我是不是完全疯了,不明白问题所在?我欢迎两种答案!

Or, am I completely mad and don't understand the problem? I welcome both kinds of answers!

推荐答案

你应该设置 pig.noSplitCombination = true.您可以在三个地方之一执行此操作.

You should set pig.noSplitCombination = true. You can do this in one of three places.

调用脚本时:

dse pig -Dpig.noSplitCombination=true /path/to/script.pig

在 Pig 脚本中:

SET pig.noSplitCombination true
table = LOAD 'cfs://ks/cf' USING CqlStorage();

或永久保存在 /etc/dse/pig/pig.properties 中.取消注释:

Or permanently in /etc/dse/pig/pig.properties. Uncomment:

pig.noSplitCombination=true

否则,Pig 可能会将您的总输入路径(组合)设置为:1.

Otherwise, Pig may set your total input paths (combined) to process: 1.

这篇关于猪&卡桑德拉&DataStax 拆分控件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆