从单个Hive UDF创建多个列 [英] Create multiple columns from single Hive UDF

查看:453
本文介绍了从单个Hive UDF创建多个列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Amazon EMR和Hive 0.11。我试图创建一个Hive UDF,它将从一个UDF调用返回多个列。



例如,我想调用一个如下所示的UDF并返回几个(命名)列。

  SELECT get_data(columnname)FROM table; 

我无法找到正在完成的文档,但听说如果使用Generic UDF。有没有人知道需要从evaluate()方法返回什么,才能正常工作?

解决方案

我只是使用GenericUDTF.After你写了一个udf扩展的GenericUDTF,你的udtf应该实现两个重要的方法:初始化和评估。
$ b


  • 在初始化中,你可以检查参数类型并设置返回对象类型。
    例如,使用ObjectInspectorFactory.getStandardStructObjectInspector,可以使用来自structFieldNames参数的名称和来自structFieldObjectInspectors的列值类型指定输出列。输出列大小是structFieldNames列表的大小。
    有两种类型的系统:java和hadoop。 java的ObjectInspector与javaXXObjectInspector是begein,否则它以writableXXObjectInspector开始。

  • 在处理过程中,它与常见的udf类似。除此之外,您应该使用从initialize()保存的ObjectInspector将Object转换为具体值,例如String,Integer等。调用​​forward函数输出一行。在行对象forwardColObj中,您可以指定列对象。


以下是一个简单的示例:

  public class UDFExtractDomainMethod extends GenericUDTF {

private static final Integer OUT_COLS = 2;
//输出列大小
私有瞬态对象forwardColObj [] = new Object [OUT_COLS];

私有瞬态ObjectInspector [] inputOIs;

/ **
*
* @param argOIs检查参数是否有效。
* @返回输出列结构。
* @throws UDFArgumentException
* /
@Override
public StructObjectInspector initialize(ObjectInspector [] argOIs)throws UDFArgumentException {
if(argOIs.length!= 1 || argOIs [0] .getCategory()!= ObjectInspector.Category.PRIMITIVE
||!argOIs [0] .getTypeName()。equals(serdeConstants.STRING_TYPE_NAME)){
throw new UDFArgumentException(split_url only带一个字符串类型的参数);
}

inputOIs = argOIs;
列表< String> outFieldNames = new ArrayList< String>();
List< ObjectInspector> outFieldOIs = new ArrayList< ObjectInspector>();
outFieldNames.add(host);
outFieldNames.add(method);
outFieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
// writableStringObjectInspector对应于hadoop.io.Text
outFieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(outFieldNames,outFieldOIs);
}

@Override
public void process(Object [] objects)throws HiveException {
try {
//需要OI将数据类型转换为获取java类型
String inUrl =((StringObjectInspector)inputOIs [0])。getPrimitiveJavaObject(objects [0]);
URI uri = new URI(inUrl);
forwardColObj [0] = uri.getHost();
forwardColObj [1] = uri.getRawPath();
//输出一行两列
forward(forwardColObj);
} catch(URISyntaxException e){
e.printStackTrace();
}
}

@Override
public void close()throws HiveException {

}
}


I am using Amazon EMR and Hive 0.11. I am trying to create a Hive UDF that will return multiple columns from one UDF call.

For example, I would like to call a UDF like the one below and be returned several (named) columns.

SELECT get_data(columnname) FROM table;

I am having trouble finding documentation of this being done, but have heard it is possible if using a Generic UDF. Does anyone know what needs to be returned from the evaluate() method for this to work?

解决方案

I just use GenericUDTF.After you write a udf extends of GenericUDTF, your udtf should implements the two important method:initialize and evaluate.

  • In initialize, you can check the argument type and set the return object type. For example, with ObjectInspectorFactory.getStandardStructObjectInspector, you specify the output columns with the name from structFieldNames argument and the column value type from structFieldObjectInspectors). The output columns size is the size of structFieldNames list. There are two type system:java and hadoop. The ObjectInspector of java is begein with javaXXObjectInspector, otherwise it starts with writableXXObjectInspector.
  • In process, it is similar to the common udf. Except that, you should use the ObjectInspector which is saved from initialize() to convert the Object to concrete value such as String, Integer and etc. Call forward function to output a row. In the row object forwardColObj, you can specific the columns object.

The following is simple example:


public class UDFExtractDomainMethod extends GenericUDTF {

    private static final Integer OUT_COLS = 2;
    //the output columns size
    private transient Object forwardColObj[] = new Object[OUT_COLS];

    private transient ObjectInspector[] inputOIs;

    /**
    *
    * @param argOIs check the argument is valid.
    * @return the output column structure.
    * @throws UDFArgumentException
    */
    @Override
    public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
        if (argOIs.length != 1 || argOIs[0].getCategory() != ObjectInspector.Category.PRIMITIVE
                || !argOIs[0].getTypeName().equals(serdeConstants.STRING_TYPE_NAME)) {
            throw new UDFArgumentException("split_url only take one argument with type of string");
        }

        inputOIs = argOIs;
        List<String> outFieldNames = new ArrayList<String>();
        List<ObjectInspector> outFieldOIs = new ArrayList<ObjectInspector>();
        outFieldNames.add("host");
        outFieldNames.add("method");
        outFieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        //writableStringObjectInspector correspond to hadoop.io.Text
        outFieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return  ObjectInspectorFactory.getStandardStructObjectInspector(outFieldNames, outFieldOIs);
    }

    @Override
    public void process(Object[] objects) throws HiveException {
        try {
            //need OI to convert data type to get java type
            String inUrl = ((StringObjectInspector)inputOIs[0]).getPrimitiveJavaObject(objects[0]);
            URI uri = new URI(inUrl);
            forwardColObj[0] = uri.getHost();
            forwardColObj[1] = uri.getRawPath();
            //output a row with two column
            forward(forwardColObj);
        } catch (URISyntaxException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void close() throws HiveException {

    }
}

这篇关于从单个Hive UDF创建多个列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆