相当于 Apache Pig 中的 linux 'diff' [英] Equivalent of linux 'diff' in Apache Pig

查看:30
本文介绍了相当于 Apache Pig 中的 linux 'diff'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够对两个大文件进行标准差异.我有一些可以工作的东西,但它不如命令行上的 diff 快.

I want to be able to do a standard diff on two large files. I've got something that will work but it's not nearly as quick as diff on the command line.

A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';

有没有更好的方法来做到这一点?

Anyone got any better ways to do this?

推荐答案

我使用以下方法.(我的 JOIN 方法非常相似,但这种方法不会用复制的行复制 diff 的行为).正如前一段时间被问到的那样,也许您只使用一个减速器作为 Pig 有算法在0.8中调整reducer的数量?

I use the following approaches. (My JOIN approach is very similar but this method does not replicate the behavior of diff with replicated lines). As this was asked sometime ago, perhaps you were using only one reducer as Pig got an algorithm to adjust the number of reducers in 0.8?

  • 我使用的两种方法在性能上都相差百分之几,但不会对重复项进行相同的处理
  • JOIN 方法折叠重复项(因此,如果一个文件的重复项多于另一个,则此方法不会输出重复项)
  • UNION 方法的工作方式类似于 Unix diff(1) 工具,并且将为正确的文件返回正确数量的额外重复项
  • 与 Unix diff(1) 工具不同,顺序并不重要(实际上,JOIN 方法执行 sort -u | diff 而 UNION 执行sort | diff)
  • 如果您有难以置信的(~数千)条重复行,那么事情会因连接而变慢(如果您的使用允许,请先对原始数据执行 DISTINCT)
  • 如果您的行很长(例如大小超过 1KB),那么建议使用 DataFu MD5 UDF 和唯一的散列差异,然后与原始文件 JOIN 以在输出之前取回原始行
  • Both approaches I use are within a few percent of eachother in performance but do not treat duplicates the same
  • The JOIN approach collapses duplicates (so, if one file has more duplicates than the other, this approach will not output the duplicate)
  • The UNION approach works like the Unix diff(1) tool and will return the correct number of extra duplicates for the correct file
  • Unlike the Unix diff(1) tool, order is not important (effectively the JOIN approach performs sort -u <foo.txt> | diff while UNION performs sort <foo> | diff)
  • If you have an incredible (~thousands) number of duplicate lines, then things will slow down due to the joins (if your use allows, perform a DISTINCT on the raw data first)
  • If your lines are very long (e.g. >1KB in size), then it would be recommended to use the DataFu MD5 UDF and only difference over hashes then JOIN with your original files to get the original row back before outputting
SET job.name 'Diff(1) Via Join'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;

-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;

-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
                    second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

使用 UNION:

SET job.name 'Diff(1)'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;

a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;

-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;

-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'

counts = FOREACH c_group {
             firsts = FILTER combined BY File == 1;
             seconds = FILTER combined BY File == 2;
             GENERATE
                FLATTEN(
                        (COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
                            (COUNT(firsts) - COUNT(seconds) > 0 ?
                                TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
                                TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
                        )
                ) AS (Row, File); };

-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
                  second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

性能

  • 使用具有 18 个节点的 LZO 压缩输入来区分超过 200GB(1,055,687,930 行)大约需要 10 分钟.
  • 每种方法只需要一个 Map/Reduce 周期.
  • 这导致每个节点每分钟大约有 1.8GB 的​​差异(不是很大的吞吐量,但在我的系统上似乎 diff(1) 仅在内存中运行,而 Hadoop 利用流磁盘.
  • It takes roughly 10 minutes to difference over 200GB (1,055,687,930 rows) using LZO compressed input with 18 nodes.
  • Each approach only takes one Map/Reduce cycle.
  • This results in roughly 1.8GB diffed per node, per minute (not a great throughput but on my system it seems diff(1) only operates in-memory, while Hadoop leverages streaming disks.

这篇关于相当于 Apache Pig 中的 linux 'diff'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆