在另一个关系上使用 FOREACH 时将关系传递给 PIG UDF? [英] Pass a relation to a PIG UDF when using FOREACH on another relation?

查看:19
本文介绍了在另一个关系上使用 FOREACH 时将关系传递给 PIG UDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在使用 Pig 0.6 来处理一些数据.我们数据的一列是用空格分隔的 id 列表(例如:35 521 225).我们正在尝试将其中一个 id 映射到另一个包含 2 列映射的文件,例如(因此第 1 列是我们的数据,第 2 列是第 3 方数据):

We are using Pig 0.6 to process some data. One of the columns of our data is a space-separated list of ids (such as: 35 521 225). We are trying to map one of those ids to another file that contains 2 columns of mappings like (so column 1 is our data, column 2 is a 3rd parties data):

35 6009
521 21599
225 51991
12 6129

35 6009
521 21599
225 51991
12 6129

我们编写了一个 UDF,它接收列值(所以:35 521 225")和文件中的映射.然后,我们将拆分列值并迭代每个值,并从传入的映射中返回第一个映射值(认为这就是逻辑上的工作方式).

We wrote a UDF that takes in the column value (so: "35 521 225") and the mappings from the file. We would then split the column value and iterate over each and return the first mapped value from the passed in mappings (thinking that is how it would logically work).

我们像这样在 PIG 中加载数据:

We are loading the data in PIG like this:

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);

mappings = LOAD 'mappings.txt' USING PigStorage() AS (ourId:chararray, theirId:chararray);

那么我们的生成是:

output = FOREACH data GENERATE title, com.example.ourudf.Mapper(category, mappings);

然而我们得到的错误是:
'解析时出错:[data::title: chararray,data::category, chararray] 中的别名映射无效`

However the error we get is:
'there is an error during parsing: Invalid alias mappings in [data::title: chararray,data::category, chararray]`

Pig 似乎试图在我们的原始数据上找到一个名为mappings"的列.如果当然不在那里.有没有办法传递加载到 UDF 中的关系?

It seems that Pig is trying to find a column called "mappings" on our original data. Which if course isn't there. Is there any way to pass a relation that is loaded into a UDF?

PIG 中的地图"类型有什么办法可以帮助我们吗?还是我们需要以某种方式加入这些值?

Is there any way the "Map" type in PIG will help us here? Or do we need to somehow join the values?

更具体地说 - 我们不想将所有类别 ID 映射到第 3 方 ID.我们只想映射第一个.UDF 将遍历我们的类别 id 列表 - 并在找到第一个映射值时返回.所以如果输入看起来像:

To be more specific - we don't want to map ALL of the category ids to the 3rd party ids. We just wanted to map the first. The UDF will iterate over the list of our category ids - and will return when it finds the first mapped value. So if the input looked like:

someProduct\t35 521 225

someProduct\t35 521 225

输出将是:
someProduct\t6009

the output would be:
someProduct\t6009

推荐答案

我不认为你可以在 Pig 中等待.

I don't think you can do it this wait in Pig.

与您想要做的类似的解决方案是在 UDF 中加载映射文件,然后在 FOREACH 中处理每条记录.PiggyBank LookupInFiles.建议使用 DistributedCache 而不是复制直接来自 DFS 的文件.

A solution similar to what you wanted to do would be to load the mapping file in the UDF and then process each record in a FOREACH. An example is available in PiggyBank LookupInFiles. It is recommended to use the DistributedCache instead of copying the file directly from the DFS.

DEFINE MAP_PRODUCT com.example.ourudf.Mapper('hdfs://../mappings.txt');

data = LOAD 'input.txt' USING PigStorage() AS (name:chararray, category:chararray);

output = FOREACH data GENERATE title, MAP_PRODUCT(category);

如果您的映射文件不是太大,这将起作用.如果它不适合内存,您将不得不对映射文件进行分区并多次运行脚本或通过添加行号并使用本机 加入 并嵌套每个产品的 FOREACH ORDER BY/LIMIT 1.

This will work if your mapping file is not too big. If it does not fit in memory you will have to partition the mapping file and run the script several time or tweak the mapping file's schema by adding a line number and use a native join and nested FOREACH ORDER BY/LIMIT 1 for each product.

这篇关于在另一个关系上使用 FOREACH 时将关系传递给 PIG UDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆