我如何猜测 Perl 中字符串的编码? [英] How can I guess the encoding of a string in Perl?
问题描述
我有一个 Unicode 字符串,但不知道它的编码是什么.当 Perl 程序读取此字符串时,是否有 Perl 将使用的默认编码?如果是这样,我怎样才能知道它是什么?
I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?
我试图从输入中去除非 ASCII 字符.我在一些论坛上发现了这个:
I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:
my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});
当没有指定输入编码时,上述如何工作?应该像下面这样指定吗?
How will the above work when no input encoding is specified? Should it be specified like the following?
my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});
推荐答案
要找出未知的东西在哪种编码中使用,您只需尝试查看即可.模块 Encode::Detect 和 Encode::Guess 自动化.(如果您在编译 Encode::Detect 时遇到问题,请尝试使用它的 fork Encode::Detective.)
To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)
use Encode::Detect::Detector;
my $unknown = "x{54}x{68}x{69}x{73}x{20}x{79}x{65}x{61}x{72}x{20}".
"x{49}x{20}x{77}x{65}x{6e}x{74}x{20}x{74}x{6f}x{20}".
"x{b1}x{b1}x{be}x{a9}x{20}x{50}x{65}x{72}x{6c}x{20}".
"x{77}x{6f}x{72}x{6b}x{73}x{68}x{6f}x{70}x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030
use Encode;
my $string = decode($encoding_name, $unknown);
我发现 encode 'ascii'
是摆脱非 ASCII 字符的蹩脚解决方案.一切都会被问号代替;这太有损而无用.
I find encode 'ascii'
is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.
# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.
如果你想要可读的 ASCII 文本,我推荐 Text::Unidecode.这也是一种有损编码,但不像上面的纯 encode
那样糟糕.
If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode
above.
use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing Perl workshop.
但是,如果可以,请避免使用那些有损编码.如果您想稍后撤消操作,请选择 PERLQQ
或 XMLCREF
之一.
However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ
or XMLCREF
.
use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ); # This year I went to x{5317}x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.
这篇关于我如何猜测 Perl 中字符串的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!