猪&卡桑德拉&DataStax 拆分控件 [英] Pig & Cassandra & DataStax Splits Control
问题描述
我一直在使用 Pig 和我的 Cassandra 数据来完成各种惊人的分组壮举,这些壮举几乎不可能通过命令式编写.我正在使用 DataStax 的 Hadoop & 集成.Cassandra,我不得不说它令人印象深刻.向那些家伙们致敬!!
I have been using Pig with my Cassandra data to do all kinds of amazing feats of groupings that would be almost impossible to write imperatively. I am using DataStax's integration of Hadoop & Cassandra, and I have to say it is quite impressive. Hat-off to those guys!!
我有一个非常小的沙箱集群(2 个节点),我将在其中对该系统进行一些测试.我有一个 CQL 表,它有大约 53M 行(大约 350 字节 ea.),我注意到 Mapper 稍后需要很长时间来处理这些 53M 行.我开始查看日志,我可以看到地图反复溢出(我看到映射器有 177 次溢出),我认为这是问题的一部分.
I have a pretty small sandbox cluster (2-nodes) where I am putting this system thru some tests. I have a CQL table that has ~53M rows (about 350 bytes ea.), and I notice that the Mapper later takes a very long time to grind thru these 53M rows. I started poking around the logs and I can see that the map is spilling repeatedly (i saw 177 spills from the mapper), and I think this is part of the problem.
CassandraInputFormat 和 JobConfig 的组合只创建了一个映射器,所以这个映射器必须从表中读取 100% 的行.我称之为反平行 :)
The combination of CassandraInputFormat and JobConfig only create a single mapper, so this mapper has to read 100% of the rows from the table. I call this anti-parallel :)
现在,这张图片中有很多齿轮在工作,包括:
Now, there are a lot of gears at work in this picture, including:
- 2 个物理节点
- hadoop 节点位于分析"DC(默认配置)中,但物理上位于同一机架中.
- 我可以使用 LOCAL_QUORUM 查看作业
有人能指出我如何让 Pig 创建更多输入 Splits 以便我可以运行更多映射器的方向吗?我有 23 个插槽;一直只用一个好像有点可惜.
Can anybody point me in the direction of how to get Pig to create more Input Splits so I can run more mappers? I have 23 slots; seems a pity to only use one all the time.
或者,我是不是完全疯了,不明白问题所在?我欢迎两种答案!
Or, am I completely mad and don't understand the problem? I welcome both kinds of answers!
推荐答案
你应该设置 pig.noSplitCombination = true
.您可以在三个地方之一执行此操作.
You should set pig.noSplitCombination = true
. You can do this in one of three places.
调用脚本时:
dse pig -Dpig.noSplitCombination=true /path/to/script.pig
在 Pig 脚本中:
SET pig.noSplitCombination true
table = LOAD 'cfs://ks/cf' USING CqlStorage();
或永久保存在 /etc/dse/pig/pig.properties
中.取消注释:
Or permanently in /etc/dse/pig/pig.properties
. Uncomment:
pig.noSplitCombination=true
否则,Pig 可能会将您的总输入路径(组合)设置为:1.
Otherwise, Pig may set your total input paths (combined) to process: 1.
这篇关于猪&卡桑德拉&DataStax 拆分控件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!