Pig skewed join 与大表导致“拆分元数据大小超过 10000000" [英] pig skewed join with a big table causes "Split metadata size exceeded 10000000"
问题描述
我们在一个小的(16M 行)不同的表和一个大的(6B 行)倾斜的表之间有一个猪连接.常规连接在 2 小时内完成(经过一些调整).我们尝试了 using skewed
并且能够将性能提高到 20 分钟.
We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table.
A regular join finishes in 2 hours (after some tweaking). We tried using skewed
and been able to improve the performance to 20 minutes.
然而,当我们尝试更大的倾斜表(19B 行)时,我们会从 SAMPLER 作业中得到以下消息:
HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job:
Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner]
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817) [ScriptRunner]
每次我们尝试使用倾斜
时都可以重现这种情况,而在我们使用常规连接时不会发生这种情况.
This is reproducible every time we try using skewed
, and does not happen when we use the regular join.
我们尝试设置 mapreduce.jobtracker.split.metainfo.maxsize=-1
,我们可以在 job.xml 文件中看到它,但它没有改变任何东西!
we tried setting mapreduce.jobtracker.split.metainfo.maxsize=-1
and we can see it's there in the job.xml file, but it doesn't change anything!
这里发生了什么?这是 using skewed
创建的分发示例的错误吗?为什么将参数更改为 -1
没有帮助?
What's happening here? Is this a bug with the distribution sample created by using skewed
? Why doesn't it help changing the param to -1
?
推荐答案
在较新版本的 Hadoop(>=2.4.0 但可能更早)中,您应该能够通过使用以下配置属性:
In newer versions of Hadoop (>=2.4.0 but maybe even earlier) you should be able to set the maximum split size at the job level by using the following configuration property:
mapreduce.job.split.metainfo.maxsize=-1
mapreduce.job.split.metainfo.maxsize=-1
这篇关于Pig skewed join 与大表导致“拆分元数据大小超过 10000000"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!