猪编码 [英] Encoding in Pig

查看:192
本文介绍了猪编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Pig Latin加载包含某些特定字符(例如À,°等)的数据,并将数据存储在.txt文件中可以看到txt文件中的这些符号显示为ï和ï字这是因为UTF-8替换字符。
我想询问是否可能以某种方式避免它,或许有一些猪命令,在结果(在txt文件),例如À而不是�?


在Pig中,我们构建了一个动态调用器,它允许一个Pig程序员引用Java函数,而不必将它们包装到定制的Pig UDF中。所以现在你可以将数据加载为UTF-8编码的字符串,然后对其进行解码,然后执行所有操作,然后将其存储为UTF-8。我想这应该适用于第一部分:

  DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode','String串'); 
encoded_strings = LOAD'encoded_strings.txt'as(encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded,'UTF-8');

负责这样做的java代码是:

  import java.io.IOException; 
import java.net.URLDecoder;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class UrlDecode extends EvalFunc< String> {

@Override
public String exec(Tuple input)throws IOException {
String encoded =(String)input.get(0);
String encoding =(String)input.get(1);
return URLDecoder.decode(encoded,encoding);
}
}

现在修改此代码返回UTF-8编码字符串从正常字符串,并将其存储到您的文本文件。希望它有效。


Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as � and ï characters. That happens because of UTF-8 substitution character. I would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of �?

解决方案

In Pig we have built in dynamic invokers that that allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs. So now u can load the data as UTF-8 encoded strings, then decode it, then perform all your operations on it and then store it back as UTF-8. I guess this should work for the first part:

    DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
    encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
    decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');

The java code responsible for doing this is:

    import java.io.IOException;
    import java.net.URLDecoder;

    import org.apache.pig.EvalFunc;
    import org.apache.pig.data.Tuple;

    public class UrlDecode extends EvalFunc<String> {

        @Override
        public String exec(Tuple input) throws IOException {
            String encoded = (String) input.get(0);
            String encoding = (String) input.get(1);
            return URLDecoder.decode(encoded, encoding);
        }
    }

Now modify this code to return UTF-8 encoded strings from normal strings and store it to your text file. Hope it works.

这篇关于猪编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆