猪& Cassandra& DataStax拆分控制 [英] Pig & Cassandra & DataStax Splits Control
问题描述
我一直使用Pig与我的Cassandra数据做各种惊人的专长,几乎不可能写入命令。我使用DataStax的集成Hadoop&卡桑德拉,我不得不说,这是相当令人印象深刻。对那些家伙来说吧!
I have been using Pig with my Cassandra data to do all kinds of amazing feats of groupings that would be almost impossible to write imperatively. I am using DataStax's integration of Hadoop & Cassandra, and I have to say it is quite impressive. Hat-off to those guys!!
我有一个非常小的沙箱集群(2节点),我通过一些测试让这个系统。我有一个CQL表有约53M行(约350字节的ea。),我注意到,Mapper以后需要很长的时间磨通过这53M行。我开始在日志周围戳,我可以看到地图重复地溢出(我看到从地图程序177溢出),我认为这是问题的一部分。
I have a pretty small sandbox cluster (2-nodes) where I am putting this system thru some tests. I have a CQL table that has ~53M rows (about 350 bytes ea.), and I notice that the Mapper later takes a very long time to grind thru these 53M rows. I started poking around the logs and I can see that the map is spilling repeatedly (i saw 177 spills from the mapper), and I think this is part of the problem.
CassandraInputFormat和JobConfig的组合仅创建一个映射程序,因此此映射程序必须从表中读取100%的行。我叫这个反并行:)
The combination of CassandraInputFormat and JobConfig only create a single mapper, so this mapper has to read 100% of the rows from the table. I call this anti-parallel :)
现在,这张图片中有很多齿轮,包括:
Now, there are a lot of gears at work in this picture, including:
- 2个物理节点
- hadoop节点位于
任何人都可以指向我。方向如何让Pig创建更多输入拆分,以便我可以运行更多的mappers?我有23个槽;似乎很可惜只能一直使用一个。
Can anybody point me in the direction of how to get Pig to create more Input Splits so I can run more mappers? I have 23 slots; seems a pity to only use one all the time.
还是,我完全疯了,不明白的问题?
Or, am I completely mad and don't understand the problem? I welcome both kinds of answers!
推荐答案
您应该设置 pig.noSplitCombination = true
。
在调用脚本时:
dse pig -Dpig.noSplitCombination=true /path/to/script.pig
Pig脚本本身:
SET pig.noSplitCombination true
table = LOAD 'cfs://ks/cf' USING CqlStorage();
或永久性地位于 /etc/dse/pig/pig.properties
。取消注释:
Or permanently in /etc/dse/pig/pig.properties
. Uncomment:
pig.noSplitCombination=true
否则,Pig可以设置你的总输入路径(组合)来处理:1。
Otherwise, Pig may set your total input paths (combined) to process: 1.
这篇关于猪& Cassandra& DataStax拆分控制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!