在猪中编码 [英] Encoding in Pig
问题描述
使用 Pig Latin 加载包含某些特定字符(例如,À、° 和其他字符)的数据并将数据存储在 .txt 文件中,可以看到 txt 文件中的这些符号显示为 � 和ï 字符.这是因为 UTF-8 替换字符.我想问一下是否有可能以某种方式避免它,也许使用一些猪命令,在结果中(在 txt 文件中)例如 À 而不是 �?
Loading data that contains some particular characters (as for example, À, ° and others) using Pig Latin and storing data in a .txt file is possible to see that these symbols in a txt file are displayed as � and ï characters. That happens because of UTF-8 substitution character. I would like to ask if is possible to avoid it somehow, maybe with some pig commands, to have in the result (in txt file) for example À instead of �?
推荐答案
在 Pig 中,我们内置了动态调用程序,允许 Pig 程序员引用 Java 函数,而无需将它们包装在自定义 Pig UDF 中.因此,现在您可以将数据作为 UTF-8 编码字符串加载,然后对其进行解码,然后对其执行所有操作,然后将其存储为 UTF-8.我想这应该适用于第一部分:
In Pig we have built in dynamic invokers that that allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs. So now u can load the data as UTF-8 encoded strings, then decode it, then perform all your operations on it and then store it back as UTF-8. I guess this should work for the first part:
DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
负责执行此操作的java代码是:
The java code responsible for doing this is:
import java.io.IOException;
import java.net.URLDecoder;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class UrlDecode extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
String encoded = (String) input.get(0);
String encoding = (String) input.get(1);
return URLDecoder.decode(encoded, encoding);
}
}
现在修改此代码以从普通字符串返回 UTF-8 编码的字符串并将其存储到您的文本文件中.希望它有效.
Now modify this code to return UTF-8 encoded strings from normal strings and store it to your text file. Hope it works.
这篇关于在猪中编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!