猪歪斜加入大表会导致“拆分元数据大小超过10000000” [英] pig skewed join with a big table causes "Split metadata size exceeded 10000000"

查看:281
本文介绍了猪歪斜加入大表会导致“拆分元数据大小超过10000000”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在一个小的(16M行)不同的桌子和一个大的(6B行)歪斜的桌子之间加入了猪。
定期加入在2小时内完成(经过一些调整)。我们使用倾斜的尝试了,并且能够将性能提高到20分钟。然而,当我们尝试一个更大的倾斜表(19B行)时,我们从SAMPLER作业中收到了这条消息:

 拆分元数据大小超过10000000.中止作业job_201305151351_21573 [ScriptRunner] 
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817)[ScriptRunner]

每当我们尝试使用skewed 时,这是可重复的,并且在我们使用常规连接时不会发生。

我们尝试设置 mapreduce.jobtracker.split.metainfo.maxsize = -1 ,我们可以看到它存在job.xml文件,但它不会改变任何东西!



这里发生了什么?这是由使用倾斜创建的分发示例的错误吗?为什么它不能帮助将参数更改为 -1

解决方案

1MB的小表足够小以适应内存,尝试复制连接。
复制连接仅限于Map,不会像其他类型的连接一样导致Reduce阶段,因此不受连接键中的歪斜影响。它应该很快。

  big = LOAD'big_data'AS(b1,b2,b3); 
tiny = LOAD'tiny_data'AS(t1,t2,t3);
mini = LOAD'mini_data'AS(m1,m2,m3);
C = JOIN big BY b1,tiny BY t1,mini BY m1使用'replicated';

大表总是语句中的第一个。



更新1:
如果原始表格中的小表格不适合内存,则作为一项工作,您需要将小表格分区为小足以适应内存,并将相同的分区应用到大表中,希望您可以将相同的分区算法添加到创建大表的系统中,以免浪费时间对其进行重新分区。
分区后,您可以使用复制连接,但需要分别为每个分区运行猪脚本。


We have a pig join between a small (16M rows) distinct table and a big (6B rows) skewed table. A regular join finishes in 2 hours (after some tweaking). We tried using skewed and been able to improve the performance to 20 minutes.

HOWEVER, when we try a bigger skewed table (19B rows), we get this message from the SAMPLER job:

Split metadata size exceeded 10000000. Aborting job job_201305151351_21573 [ScriptRunner]
at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48)
at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:817) [ScriptRunner]

This is reproducible every time we try using skewed, and does not happen when we use the regular join.

we tried setting mapreduce.jobtracker.split.metainfo.maxsize=-1 and we can see it's there in the job.xml file, but it doesn't change anything!

What's happening here? Is this a bug with the distribution sample created by using skewed? Why doesn't it help changing the param to -1?

解决方案

Small table of 1MB is small enough to fit into memory, try replicated join. Replicated join is Map only, does not cause Reduce stage as other types of join, thus is immune to the skew in the join keys. It should be quick.

big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';

Big table is always the first one in the statement.

UPDATE 1: If small table in its original form does not fit into memory,than as a work around you would need to partition your small table into partitions that are small enough to fit into memory and than apply the same partitioning to the big table, hopefully you could add the same partitioning algorithm to the system which creates big table, so that you do not waste time repartitioning it. After partitioning, you can use replicated join, but it will require running pig script for each partition separately.

这篇关于猪歪斜加入大表会导致“拆分元数据大小超过10000000”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆