Hive UDF 文本到数组 [英] Hive UDF Text to array

查看:35
本文介绍了Hive UDF 文本到数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为 Hive 创建一些 UDF,它为我提供了比已经提供的 split() 函数更多的功能.

I'm trying to create some UDF for Hive which is giving me some more functionality than the already provided split() function.

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class LowerCase extends UDF {

  public Text evaluate(final Text text) {
    return new Text(stemWord(text.toString()));
  }

  /**
   * Stems words to normal form.
   * 
   * @param word
   * @return Stemmed word.
   */
  private String stemWord(String word) {
    word = word.toLowerCase();
    // Remove special characters
    // Porter stemmer
    // ...
    return word;
  }
}

这在 Hive 中有效.我将这个类导出到一个 jar 文件中.然后我将它加载到 Hive 中

This is working in Hive. I export this class into a jar file. Then I load it into Hive with

添加jar/path/to/myJar.jar;

并使用

创建临时函数lower_case作为'LowerCase';

我有一个包含字符串字段的表格.然后语句是:

I've got a table with a String field in it. The statement is then:

从文档中选择小写(文本);

但现在我想创建一个返回数组的函数(例如 split 所做的).

But now I want to create a function returning an array (as e.g. split does).

import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class WordSplit extends UDF {

  public Text[] evaluate(final Text text) {
    List<Text> splitList = new ArrayList<>();

    StringTokenizer tokenizer = new StringTokenizer(text.toString());

    while (tokenizer.hasMoreElements()) {
      Text word = new Text(stemWord((String) tokenizer.nextElement()));

      splitList.add(word);
    }

    return splitList.toArray(new Text[splitList.size()]);
  }

  /**
   * Stems words to normal form.
   * 
   * @param word
   * @return Stemmed word.
   */
  private String stemWord(String word) {
    word = word.toLowerCase();
    // Remove special characters
    // Porter stemmer
    // ...
    return word;
  }
}

不幸的是,如果我执行上述完全相同的加载过程,则此功能不起作用.我收到以下错误:

Unfortunately this function does not work if I do the exact same loading procedure mentioned above. I'm getting the following error:

FAILED: SemanticException java.lang.IllegalArgumentException: 错误:名称应位于struct<>"的位置 7但是'>'找到了.

由于我没有找到任何提及这种转换的文档,我希望您能给我一些建议!

As I haven't found any documentation mentioning this kind of transformation, I'm hoping that you will have some advice for me!

推荐答案

我认为UDF"界面不会提供您想要的.您想使用 GenericUDF.我将使用拆分 UDF 的来源作为指导.

I don't think 'UDF' interface will provide what you want. You want to use GenericUDF. I would use the source of the split UDF as a guide.

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-exec/0.7.1-cdh3u1/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSplit.java

这篇关于Hive UDF 文本到数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆