Pig Script:加入多个文件 [英] Pig Script: Join with multiple files

查看:29
本文介绍了Pig Script:加入多个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在读取一个大文件(超过 10 亿条记录)并将它与其他三个文件连接起来,我想知道是否可以提高该过程的效率以避免对大表进行多次读取.小表可能不适合内存.

I am reading a big file (more than a billion records) and joining it with three other files, I was wondering if there is anyway the process can be made more efficient to avoid multiple reads on the big table.The smalltables may not fit in memory.

A = join smalltable1 by  (f1,f2) RIGHT OUTER,massive by (f1,f2) ;
B = join smalltable2 by  (f3) RIGHT OUTER, A by (f3) ;
C = join smalltable3 by  (f4) ,B by (f4) ;

我想到的另一种方法是写一个 udf 并在一次读取中替换值,但我不确定 udf 是否有效,因为小文件不适合内存.实现可能是这样的:

The alternative that I was thinking is to write a udf and replace values in one read, but I am not sure if a udf would be efficient since the small files won't fit in the memory. The implementation could be like:

A = LOAD massive 
B = generate f1,udfToTranslateF1(f1),f2,udfToTranslateF2(f2),f3,udfToTranslateF3(f3)

欣赏你的想法...

推荐答案

Pig 0.10 引入了与布隆过滤器的集成 http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522

Pig 0.10 introduced integration with Bloom Filters http://search-hadoop.com/c/Pig:/src/org/apache/pig/builtin/Bloom.java%7C%7C+%2522done+%2522exec+Tuple%2522

您可以在 3 个较小的文件上训练布隆过滤器并过滤大文件,希望它会产生较小的文件.之后执行标准连接以获得 100% 的精度.

You can train a bloom filter on the 3 smaller files and filter big file, hopefully it will result in a smaller file. After that perform standard joins to get 100% precision.

更新 1当您连接不同的键时,您实际上需要训练 2 个布隆过滤器,每个小表一个.

UPDATE 1 You would actually need to train 2 Bloom Filters, one for each of the small tables, as you join on different keys.

更新 2评论中提到外连接用于扩充数据.在这种情况下,布隆过滤器可能不是最好的选择,它们适用于过滤而不是在外部联接中添加数据,因为您希望保留不匹配的数据.更好的方法是在各个字段(f1、f2、f3、f4)上对所有小表进行分区,将每个分区存储到一个足够小的单独文件中以加载到内存中.比在 f1、f2、f3、f4 上的 Group BY 大表以及在 FOREACH 中将组 (f1、f2、f3、f4) 与关联包传递给用 Java 编写的自定义函数,该函数将小文件的各个分区加载到RAM 并执行扩充.

UPDATE 2 It was mentioned in the comments that the outer join is used for augmenting data. In this case Bloom Filters might not be the best thing, they are good for filtering and not adding data in outer joins, as you want to keep the non matched data. A better approach would be to partition all small tables on respective fields (f1, f2, f3, f4), store each partition into a separate file small enough to load into memory. Than Group BY massive table on f1, f2, f3, f4 and in a FOREACH pass the group (f1, f2, f3, f4) with associated bag to the custom function written in Java, that loads the respective partitions of the small files into RAM and performs augmentation.

这篇关于Pig Script:加入多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆