在另一个关系上使用FOREACH时,将关系传递给PIG UDF? [英] Pass a relation to a PIG UDF when using FOREACH on another relation?

查看:88
本文介绍了在另一个关系上使用FOREACH时,将关系传递给PIG UDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用Pig 0.6来处理一些数据。我们数据的其中一列是空格分隔的ID列表(例如:35 521 225)。我们试图将这些id中的一个映射到另一个包含2列映射的文件(比如,第1列是我们的数据,第2列是第3方数据):



35 6009

521 21599

225 51991

12 6129



我们编写了一个UDF,获取列值(如:35 521 225)和来自文件的映射。然后,我们将分割列值并迭代每个值,并从传入的映射中返回第一个映射值(认为这是逻辑上的工作方式)。

我们是像这样在PIG中加载数据:
$ b

data = LOAD'input.txt'使用PigStorage()AS(名称:chararray,类别:chararray );


$ b

mappings = LOAD'mappings.txt'使用PigStorage()AS(ourId:chararray,theirId: chararray);



然后我们的生成是:

output = FOREACH data GENERATE title,com.example.ourudf.Mapper(category,mappings);



然而,我们得到的错误是:

'解析时出现错误:[data :: title:chararray,data :: category,chararray]中的无效别名映射``

看起来,Pig正试图在我们的原始数据上找到一个名为映射的列。如果课程不在那里。有什么办法可以传递一个加载到UDF中的关系?



在PIG中,Map类型有什么方法可以帮助我们吗?或者我们需要以某种方式加入价值?编辑:更具体地说 - 我们不想将所有的类别ID映射到第三方ID。我们只是想映射第一个。 UDF将遍历我们类别id的列表 - 并在找到第一个映射值时返回。因此,如果输入如下所示:

someProduct \ t35 521 225



输出结果为: >
someProduct\t6009

解决方案

我不认为您可以在Pig中这样等待。



类似于您想要做的解决方案是将映射文件加载到UDF中,然后在FOREACH中处理每条记录。 PiggyBank有一个例子 LookupInFiles 。建议使用分布式缓存,而不要复制文件直接来自DFS。

  DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings。文本'); 

data = LOAD'input.txt'使用PigStorage()AS(名称:chararray,category:chararray);

output = FOREACH数据GENERATE title,MAP_PRODUCT(category);

如果您的映射文件不太大,这将会起作用。如果它不适合内存,则必须分割映射文件并多次运行脚本,或者通过添加行号来调整映射文件的模式,并使用本地加入,并为每件商品嵌入FOREACH ORDER BY / LIMIT 1。


We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):

35 6009
521 21599
225 51991
12 6129

We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).

We are loading the data in PIG like this:

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);

mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);

Then our generate is:

output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);

However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`

It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?

Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?

EDIT: To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:

someProduct\t35 521 225

the output would be:
someProduct\t6009

解决方案

I don't think you can do it this wait in Pig.

A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.

DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);

output = FOREACH data GENERATE title, MAP_PRODUCT(category);

This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

这篇关于在另一个关系上使用FOREACH时,将关系传递给PIG UDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆