相应的Linux Apache中猪'差异'的 [英] Equivalent of linux 'diff' in Apache Pig

查看:242
本文介绍了相应的Linux Apache中猪'差异'的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够做的两个大文件的标准差异。我已经得到的东西,将工作,但它并不像在命令行上差异一样快。

I want to be able to do a standard diff on two large files. I've got something that will work but it's not nearly as quick as diff on the command line.

A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';

任何人有什么更好的办法来做到这一点?

Anyone got any better ways to do this?

推荐答案

我用下面的方法。 (我的JOIN方法是非常相似,但这种方法并不能复制的差异与复制线行为)。由于这是较早前问,也许你只使用一个减速猪<一个href=\"https://cwiki.apache.org/PIG/faq.html#FAQ-Q%253AHowdoImakemyPigjobsrunonaspecifiednumberofreducers%253F\"相对=nofollow>得到了一个算法来调整0.8减速器的数量?

I use the following approaches. (My JOIN approach is very similar but this method does not replicate the behavior of diff with replicated lines). As this was asked sometime ago, perhaps you were using only one reducer as Pig got an algorithm to adjust the number of reducers in 0.8?


  • 我用这两种方法是海誓山盟的性能有几个百分点之内,但不要把重复相同

  • 的连接方法重复崩溃(这样,如果一个文件具有比其他副本,这种方法不会输出重复)

  • 联盟的方式类似于UNIX的差异(1)工具并返回额外的重复正确数量的正确的文件

  • 与Unix的差异(1)工具,顺序并不重要(有效的JOIN方法执行排序-u&LT; foo.txt的&GT; |差异而UNION执行排序&LT;富&GT; |差异)

  • 如果您有重复行一个令人难以置信的(〜千)数量,那么事情将放缓,由于联接(如果你使用的允许,对原始数据进行不同的第一)

  • 如果您的线路很长(如> 1KB大小),那么这将是推荐使用的 DataFu MD5 UDF和唯一的区别在哈希值,然后用你的原始文件JOIN来获得原始行回输前

  • Both approaches I use are within a few percent of eachother in performance but do not treat duplicates the same
  • The JOIN approach collapses duplicates (so, if one file has more duplicates than the other, this approach will not output the duplicate)
  • The UNION approach works like the Unix diff(1) tool and will return the correct number of extra duplicates for the correct file
  • Unlike the Unix diff(1) tool, order is not important (effectively the JOIN approach performs sort -u <foo.txt> | diff while UNION performs sort <foo> | diff)
  • If you have an incredible (~thousands) number of duplicate lines, then things will slow down due to the joins (if your use allows, perform a DISTINCT on the raw data first)
  • If your lines are very long (e.g. >1KB in size), then it would be recommended to use the DataFu MD5 UDF and only difference over hashes then JOIN with your original files to get the original row back before outputting
SET job.name 'Diff(1) Via Join'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;

-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;

-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
                    second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

使用UNION:

SET job.name 'Diff(1)'

-- Erase Outputs
rmf first_only
rmf second_only

-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;

a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;

-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;

-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'

counts = FOREACH c_group {
             firsts = FILTER combined BY File == 1;
             seconds = FILTER combined BY File == 2;
             GENERATE
                FLATTEN(
                        (COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
                            (COUNT(firsts) - COUNT(seconds) > 0 ?
                                TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
                                TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
                        )
                ) AS (Row, File); };

-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
                  second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();

性能


  • 这需要使用LZO COM pressed输入与18个节点200GB以上大约10分钟,差(1055687930行)。

  • 每种方法只需要一个的Map / Reduce周期。

  • 这导致每个节点大约1.8GB显示差异,每分钟(不是一个伟大的吞吐量,但在我的系统似乎差异(1)仅运行在内存中,而充分利用Hadoop的流盘。

  • It takes roughly 10 minutes to difference over 200GB (1,055,687,930 rows) using LZO compressed input with 18 nodes.
  • Each approach only takes one Map/Reduce cycle.
  • This results in roughly 1.8GB diffed per node, per minute (not a great throughput but on my system it seems diff(1) only operates in-memory, while Hadoop leverages streaming disks.

这篇关于相应的Linux Apache中猪'差异'的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆