根据Pig中的数据将关系拆分为不同的输出文件 [英] Split relation into different output files according to data in Pig

查看:82
本文介绍了根据Pig中的数据将关系拆分为不同的输出文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我的数据如下:

1 A a
1 A b
2 B b
2 B c
3 A a
3 B b
3 C c

我想根据第一列中的数据将它们存储在不同的文件中.所以,我希望我的输出类似于此

I want to store these in different files depending on the data in the first column. So, I would like my output to be similar to this

1.out包含

A a
A b

2.out包含

B b
B c

3.out包含

A a
B b
C c

是否可以通过使用带/不带UDF的Pig来实现此目的?

Is there anyway to achieve this using Pig with/without UDFs?

非常感谢您.

推荐答案

我离开了我现在使用的集群,所以我不能100%确定,但这应该在正确的路径上:

I'm away from the cluster I use right now so I can't be 100% sure, but this should be on the right path:

-- Assuming myData.txt is formatted like:
-- 1 A b
-- 2 B c
-- etc.
A = LOAD 'myData.txt' USING PigStorage(' ') 
                      AS (number: int, val1: chararray, val2: chararray) ;
STORE A INTO 'myOutputDir'
        -- Stores using \t as the input separator
        USING org.apache.pig.piggybank.storage.MultiStorage('myOutputDir', '0') ;

如果以这种方式进行操作,则将创建3个目录(分别用于1、2和3),在这些目录中,只有与文件夹名称相同编号的文件将位于其下.但是,在这些目录中的每个目录中,可以有许多不同的文件(每个映射器/缩减器一个).此外,还必须存储字段0.因此,输出看起来可能像这样:

If you do it this way then 3 directories will be created (for 1, 2, and 3), and in those directories only files with the same number as the name of the folder will be under them. However, in each of these directories there can be many different files (one for each mapper/reducer). Additionally, field 0 will also have to be stored. So, the output could look something like this:

--myOutputDir
|
|-->1
| |-->1-00000 #Contains 1 A a
| |-->1-00001 #Contains 1 A b
|
|-->2
| |-->2-00000 #Contains 2 B b
| |-->2-00001 #Contains 2 B c
|
|-->3
| |-->3-00000 #Contains 3 A a, 3 B b
| |-->3-00001 #Contains 3 C c
|

3-00000的内容:

Contents of 3-00000:

3   A   a
3   B   b

但是,由于您知道输出文件的名称,因此可以加载创建的每个输出目录,并根据需要设置其格式:

However, because you know the name of the output file, you can load each output directory you created and format them as you wish:

-- Repeat this for all the numbers
A3 = LOAD 'myOutputDir/3' AS (number: int, val1: chararray, val2: chararray) ;
B3 = FOREACH A3 GENERATE val1, val2 ; 
STORE B3 INTO 'myOutputDir/stripped3' ;

所以现在输出将如下所示:

So now the output will look like:

A    a
B    b
C    c

但是根据映射器作业的数量,数据仍然可以在几个文件中分割.如果它们需要全部放在同一个文件中,我只建议编写一个脚本,将各个部分合并在一起.我使用这样的东西(但显然更通用):

But depending on the number of mapper jobs, the data can still be split among several files. If they need to be all in the same file I'd just recommend writing a script that merges the parts together. I use something like this (but obviously more general):

import os
import glob
partfiles = os.path.join('myOutputDir', 'stripped3', 'part-m-[0-9]*')
with open('part-m-COMPLETE-3', 'w') as outfile:
    for myfile in glob.glob(partfiles):
        with open(myfile, 'r') as infile:
            for line in infile:
                outfile.write(line)

这篇关于根据Pig中的数据将关系拆分为不同的输出文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆