Pig skewed join 与大表导致“拆分元数据大小超过 10000000" [英] pig skewed join with a big table causes "Split metadata size exceeded 10000000"

查看:27
本文介绍了Pig skewed join 与大表导致“拆分元数据大小超过 10000000"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在一个小的(16M 行)不同的表和一个大的(6B 行)倾斜的表之间有一个猪连接.常规连接在 2 小时内完成(经过一些调整).我们尝试了 using skewed 并且能够将性能提高到 20 分钟.

We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table. A regular join finishes in 2 hours (after some tweaking). We tried using skewed and been able to improve the performance to 20 minutes.

然而,当我们尝试更大的倾斜表(19B 行)时,我们会从 SAMPLER 作业中得到以下消息:

HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job:

Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner]
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817) [ScriptRunner]

每次我们尝试使用倾斜时都可以重现这种情况,而在我们使用常规连接时不会发生这种情况.

This is reproducible every time we try using skewed, and does not happen when we use the regular join.

我们尝试设置 mapreduce.jobtracker.split.metainfo.maxsize=-1,我们可以在 job.xml 文件中看到它,但它没有改变任何东西!

we tried setting mapreduce.jobtracker.split.metainfo.maxsize=-1 and we can see it's there in the job.xml file, but it doesn't change anything!

这里发生了什么?这是 using skewed 创建的分发示例的错误吗?为什么将参数更改为 -1 没有帮助?

What's happening here? Is this a bug with the distribution sample created by using skewed? Why doesn't it help changing the param to -1?

推荐答案

在较新版本的 Hadoop(>=2.4.0 但可能更早)中,您应该能够通过使用以下配置属性:

In newer versions of Hadoop (>=2.4.0 but maybe even earlier) you should be able to set the maximum split size at the job level by using the following configuration property:

mapreduce.job.split.metainfo.maxsize=-1

mapreduce.job.split.metainfo.maxsize=-1

这篇关于Pig skewed join 与大表导致“拆分元数据大小超过 10000000"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆