Pig - 将 Databag 传递给 UDF 构造函数 [英] Pig - passing Databag to UDF constructor

查看:22
本文介绍了Pig - 将 Databag 传递给 UDF 构造函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个脚本正在加载有关场地的一些数据:

I have a script which is loading some data about venues:

venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);

然后我想创建具有接受场地类型的构造函数的 UDF.

Then I want to create UDF which has a constructor that is accepting venues type.

所以我试着这样定义这个 UDF:

So I tried to define this UDF like that:

DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);

这是实际的 UDF:

public class GenerateVenues extends EvalFunc<Tuple> {

    TupleFactory mTupleFactory = TupleFactory.getInstance();
    BagFactory mBagFactory = BagFactory.getInstance();

    private static final String ALLCHARS = "(.*)";
    private ArrayList<String> venues;

    private String regex;

    public GenerateVenues(DataBag venuesBag) {
        Iterator<Tuple> it = venuesBag.iterator();
        venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
        String current = "";
        regex = "";
        while (it.hasNext()){
            Tuple t = it.next();
            try {
                current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
                venues.add((String) t.get(0));
            } catch (ExecException e) {
                throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
            }
            regex += current + (it.hasNext() ? "|" : "");
        }
    }

    @Override
    public Tuple exec(Tuple tuple) throws IOException {
        // expect one string
        if (tuple == null || tuple.size() != 2) {
            throw new IllegalArgumentException(
                    "BagTupleExampleUDF: requires two input parameters.");
        }
        try {
            String tweet = (String) tuple.get(0);
            for (String venue: venues)
            {
                if (tweet.matches(ALLCHARS + venue + ALLCHARS))
                {
                    Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
                    return output;
                }
            }
            return null;
        } catch (Exception e) {
            throw new IOException(
                    "BagTupleExampleUDF: caught exception processing input.", e);
        }
    }
}

执行时,脚本在 (venues); 之前的 DEFINE 部分触发错误:

When executed the script is firing error at the DEFINE part just before (venues);:

2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60>  mismatched input 'venues' expecting RIGHT_PAREN

显然我做错了什么,你能帮我找出问题所在吗?是不是UDF不能接受场地关系作为参数.或者这种关系不是由 DataBag 表示的,就像这样 public GenerateVenues(DataBagvenuesBag)?谢谢!

Obviously I'm doing something wrong, can you help me out figuring out what's wrong. Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag like this public GenerateVenues(DataBag venuesBag)? Thanks!

PS 我使用的是 Pig 版本 0.11.1.1.3.0.0-107.

PS I'm using Pig version 0.11.1.1.3.0.0-107.

推荐答案

正如@WinnieNicklaus 已经说过的,您可以将字符串传递给 UDF 构造函数.

As @WinnieNicklaus already said, you can only pass strings to UDF constructors.

话虽如此,解决您的问题的方法是使用分布式缓存,您需要覆盖public List;getCacheFiles() 返回将通过分布式缓存可用的文件名列表.这样,您就可以将该文件作为本地文件读取并构建您的表.

Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles() to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.

缺点是Pig没有初始化函数,所以你必须实现类似

The downside is that Pig has no initialization function, so you have to implement something like

private void init() {
    if (!this.initialized) {
        // read table
    }
}

然后调用它作为 exec 的第一件事.

and then call that as the first thing from exec.

这篇关于Pig - 将 Databag 传递给 UDF 构造函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆