Pig-将Databag传递给UDF构造函数 [英] Pig - passing Databag to UDF constructor

查看:137
本文介绍了Pig-将Databag传递给UDF构造函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个脚本正在加载有关场地的一些数据:

I have a script which is loading some data about venues:

venues = LOAD 'venues_extended_2.csv' USING org.apache.pig.piggybank.storage.CSVLoader() AS (Name:chararray, Type:chararray, Latitude:double, Longitude:double, City:chararray, Country:chararray);

然后我要创建一个UDF,它的构造函数接受场所类型.

Then I want to create UDF which has a constructor that is accepting venues type.

所以我试图像这样定义这个UDF:

So I tried to define this UDF like that:

DEFINE GenerateVenues org.gla.anton.udf.main.GenerateVenues(venues);

这是实际的UDF:

public class GenerateVenues extends EvalFunc<Tuple> {

    TupleFactory mTupleFactory = TupleFactory.getInstance();
    BagFactory mBagFactory = BagFactory.getInstance();

    private static final String ALLCHARS = "(.*)";
    private ArrayList<String> venues;

    private String regex;

    public GenerateVenues(DataBag venuesBag) {
        Iterator<Tuple> it = venuesBag.iterator();
        venues = new ArrayList<String>((int) (venuesBag.size() + 1)); // possible fails!!!
        String current = "";
        regex = "";
        while (it.hasNext()){
            Tuple t = it.next();
            try {
                current = "(" + ALLCHARS + t.get(0) + ALLCHARS + ")";
                venues.add((String) t.get(0));
            } catch (ExecException e) {
                throw new IllegalArgumentException("VenuesRegex: requires tuple with at least one value");
            }
            regex += current + (it.hasNext() ? "|" : "");
        }
    }

    @Override
    public Tuple exec(Tuple tuple) throws IOException {
        // expect one string
        if (tuple == null || tuple.size() != 2) {
            throw new IllegalArgumentException(
                    "BagTupleExampleUDF: requires two input parameters.");
        }
        try {
            String tweet = (String) tuple.get(0);
            for (String venue: venues)
            {
                if (tweet.matches(ALLCHARS + venue + ALLCHARS))
                {
                    Tuple output = mTupleFactory.newTuple(Collections.singletonList(venue));
                    return output;
                }
            }
            return null;
        } catch (Exception e) {
            throw new IOException(
                    "BagTupleExampleUDF: caught exception processing input.", e);
        }
    }
}

执行后,脚本在(venues);之前的DEFINE部分触发错误:

When executed the script is firing error at the DEFINE part just before (venues);:

2013-12-19 04:28:06,072 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 6, column 60>  mismatched input 'venues' expecting RIGHT_PAREN

很明显,我做错了什么,您能帮我找出问题出在哪里吗? 是UDF不能接受场所关系作为参数.还是该关系不像public GenerateVenues(DataBag venuesBag)那样由DataBag表示? 谢谢!

Obviously I'm doing something wrong, can you help me out figuring out what's wrong. Is it the UDF that cannot accept the venues relation as a parameter. Or the relation is not represented by DataBag like this public GenerateVenues(DataBag venuesBag)? Thanks!

PS我使用的是Pig版本 0.11.1.1.3.0.0-107 .

PS I'm using Pig version 0.11.1.1.3.0.0-107.

推荐答案

正如@WinnieNicklaus已经说过的,您只能将字符串传递给UDF构造函数.

As @WinnieNicklaus already said, you can only pass strings to UDF constructors.

话虽如此,您的问题的解决方案是使用分布式缓存,您需要覆盖public List<String> getCacheFiles()以返回将通过分布式缓存提供的文件名列表.这样,您可以将文件读取为本地文件并构建表.

Having said that, the solution to your problem is using distributed cache, you need to override public List<String> getCacheFiles() to return a list of filenames that will be made available via distributed cache. With that, you can read the file as a local file and build your table.

不利之处在于Pig没有初始化功能,因此您必须实现类似

The downside is that Pig has no initialization function, so you have to implement something like

private void init() {
    if (!this.initialized) {
        // read table
    }
}

,然后将其作为exec中的第一件事.

and then call that as the first thing from exec.

这篇关于Pig-将Databag传递给UDF构造函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆