根据 Pig 中的数据将关系拆分为不同的输出文件 [英] Split relation into different output files according to data in Pig

查看:24
本文介绍了根据 Pig 中的数据将关系拆分为不同的输出文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,我的数据如下所示:

Currently, my data looks like this:

1 A a
1 A b
2 B b
2 B c
3 A a
3 B b
3 C c

我想根据第一列中的数据将这些存储在不同的文件中.所以,我希望我的输出与此类似

I want to store these in different files depending on the data in the first column. So, I would like my output to be similar to this

1.out 包含

A a
A b

2.out 包含

B b
B c

3.out 包含

A a
B b
C c

无论如何使用带/不带 UDF 的 Pig 来实现这一点?

Is there anyway to achieve this using Pig with/without UDFs?

非常感谢.

推荐答案

我离开了我现在使用的集群,所以我不能 100% 确定,但这应该在正确的道路上:

I'm away from the cluster I use right now so I can't be 100% sure, but this should be on the right path:

-- Assuming myData.txt is formatted like:
-- 1 A b
-- 2 B c
-- etc.
A = LOAD 'myData.txt' USING PigStorage(' ') 
                      AS (number: int, val1: chararray, val2: chararray) ;
STORE A INTO 'myOutputDir'
        -- Stores using \t as the input separator
        USING org.apache.pig.piggybank.storage.MultiStorage('myOutputDir', '0') ;

如果您这样做,则会创建 3 个目录(分别为 1、2 和 3),并且在这些目录中,只有与文件夹名称具有相同编号的文件才会位于它们之下.但是,在这些目录中的每一个目录中都可以有许多不同的文件(每个映射器/减速器一个).此外,还必须存储字段 0.因此,输出可能如下所示:

If you do it this way then 3 directories will be created (for 1, 2, and 3), and in those directories only files with the same number as the name of the folder will be under them. However, in each of these directories there can be many different files (one for each mapper/reducer). Additionally, field 0 will also have to be stored. So, the output could look something like this:

--myOutputDir
|
|-->1
| |-->1-00000 #Contains 1 A a
| |-->1-00001 #Contains 1 A b
|
|-->2
| |-->2-00000 #Contains 2 B b
| |-->2-00001 #Contains 2 B c
|
|-->3
| |-->3-00000 #Contains 3 A a, 3 B b
| |-->3-00001 #Contains 3 C c
|

3-00000 的内容:

Contents of 3-00000:

3   A   a
3   B   b

但是,因为您知道输出文件的名称,所以您可以加载您创建的每个输出目录并根据需要对其进行格式化:

However, because you know the name of the output file, you can load each output directory you created and format them as you wish:

-- Repeat this for all the numbers
A3 = LOAD 'myOutputDir/3' AS (number: int, val1: chararray, val2: chararray) ;
B3 = FOREACH A3 GENERATE val1, val2 ; 
STORE B3 INTO 'myOutputDir/stripped3' ;

所以现在输出看起来像:

So now the output will look like:

A    a
B    b
C    c

但是根据映射器作业的数量,数据仍然可以拆分到多个文件中.如果它们需要全部在同一个文件中,我只建议编写一个将各个部分合并在一起的脚本.我使用这样的东西(但显然更通用):

But depending on the number of mapper jobs, the data can still be split among several files. If they need to be all in the same file I'd just recommend writing a script that merges the parts together. I use something like this (but obviously more general):

import os
import glob
partfiles = os.path.join('myOutputDir', 'stripped3', 'part-m-[0-9]*')
with open('part-m-COMPLETE-3', 'w') as outfile:
    for myfile in glob.glob(partfiles):
        with open(myfile, 'r') as infile:
            for line in infile:
                outfile.write(line)

这篇关于根据 Pig 中的数据将关系拆分为不同的输出文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆