Hive UDF文本到数组 [英] Hive UDF Text to array

查看:1527
本文介绍了Hive UDF文本到数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图为Hive创建一些UDF,它比已经提供的 split()函数提供了更多的功能。

  import org.apache.hadoop.hive.ql.exec.UDF; 
import org.apache.hadoop.io.Text;

public class LowerCase extends UDF {

public Text evaluate(final Text text){
return new Text(stemWord(text.toString()));
}

/ **
*将词语改为正常形式。
*
* @param word
* @return词干。
* /
private String stemWord(String word){
word = word.toLowerCase();
//删除特殊字符
// Porter stemmer
// ...
返回字;
}
}

这个工作在Hive中。我将这个类导出到一个jar文件中。然后,我将它加载到Hive中,使用

add jar /path/to/myJar.jar;



并创建一个函数使用

将临时函数lower_case创建为'LowerCase';



我有一个带有String字段的表格。该语句如下:



从文档中选择lower_case(text);



但是现在我想创建一个返回数组的函数(例如split)。

  import java .util.ArrayList; 
import java.util.List;
import java.util.StringTokenizer;

导入org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class WordSplit extends UDF {

public Text [] evaluate(final Text text){
List< Text> splitList = new ArrayList<>();

StringTokenizer tokenizer = new StringTokenizer(text.toString()); $())

while(tokenizer.hasMoreElements()){
Text word = new Text(stemWord((String)tokenizer.nextElement()));

splitList.add(word);
}

return splitList.toArray(new Text [splitList.size()]);
}

/ **
*将词语改为正常形式。
*
* @param word
* @return词干。
* /
private String stemWord(String word){
word = word.toLowerCase();
//删除特殊字符
// Porter stemmer
// ...
返回字;






$ b

不幸的是,这个函数不起作用,上述相同的装载程序。我收到以下错误:

FAILED:SemanticException java.lang.IllegalArgumentException:错误:预期位于'struct< >'but'>'found。



由于我没有发现任何提及这种转换的文档,希望你对我有一些建议!

解决方案

我不认为'UDF'界面会提供你想要的。你想使用GenericUDF。我将使用分割UDF的来源作为指导。

http://grepcode.com/file/repository.cloudera.com/ content / repositories / releases / org.apache.hadoop.hive / hive-exec / 0.7.1-cdh3u1 / org / apache / hadoop / hive / ql / udf / generic / GenericUDFSplit.java


I'm trying to create some UDF for Hive which is giving me some more functionality than the already provided split() function.

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class LowerCase extends UDF {

  public Text evaluate(final Text text) {
    return new Text(stemWord(text.toString()));
  }

  /**
   * Stems words to normal form.
   * 
   * @param word
   * @return Stemmed word.
   */
  private String stemWord(String word) {
    word = word.toLowerCase();
    // Remove special characters
    // Porter stemmer
    // ...
    return word;
  }
}

This is working in Hive. I export this class into a jar file. Then I load it into Hive with

add jar /path/to/myJar.jar;

and create a function using

create temporary function lower_case as 'LowerCase';

I've got a table with a String field in it. The statement is then:

select lower_case(text) from documents;

But now I want to create a function returning an array (as e.g. split does).

import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class WordSplit extends UDF {

  public Text[] evaluate(final Text text) {
    List<Text> splitList = new ArrayList<>();

    StringTokenizer tokenizer = new StringTokenizer(text.toString());

    while (tokenizer.hasMoreElements()) {
      Text word = new Text(stemWord((String) tokenizer.nextElement()));

      splitList.add(word);
    }

    return splitList.toArray(new Text[splitList.size()]);
  }

  /**
   * Stems words to normal form.
   * 
   * @param word
   * @return Stemmed word.
   */
  private String stemWord(String word) {
    word = word.toLowerCase();
    // Remove special characters
    // Porter stemmer
    // ...
    return word;
  }
}

Unfortunately this function does not work if I do the exact same loading procedure mentioned above. I'm getting the following error:

FAILED: SemanticException java.lang.IllegalArgumentException: Error: name expected at the position 7 of 'struct<>' but '>' is found.

As I haven't found any documentation mentioning this kind of transformation, I'm hoping that you will have some advice for me!

解决方案

I don't think 'UDF' interface will provide what you want. You want to use GenericUDF. I would use the source of the split UDF as a guide.

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop.hive/hive-exec/0.7.1-cdh3u1/org/apache/hadoop/hive/ql/udf/generic/GenericUDFSplit.java

这篇关于Hive UDF文本到数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆