我如何猜测 Perl 中字符串的编码? [英] How can I guess the encoding of a string in Perl?

查看:21
本文介绍了我如何猜测 Perl 中字符串的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Unicode 字符串,但不知道它的编码是什么.当 Perl 程序读取此字符串时,是否有 Perl 将使用的默认编码?如果是这样,我怎样才能知道它是什么?

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?

我试图从输入中去除非 ASCII 字符.我在一些论坛上发现了这个:

I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:

my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});

当没有指定输入编码时,上述如何工作?应该像下面这样指定吗?

How will the above work when no input encoding is specified? Should it be specified like the following?

my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});

推荐答案

要找出未知的东西在哪种编码中使用,您只需尝试查看即可.模块 Encode::DetectEncode::Guess 自动化.(如果您在编译 Encode::Detect 时遇到问题,请尝试使用它的 fork Encode::Detective.)

To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)

use Encode::Detect::Detector;
my $unknown = "x{54}x{68}x{69}x{73}x{20}x{79}x{65}x{61}x{72}x{20}".
              "x{49}x{20}x{77}x{65}x{6e}x{74}x{20}x{74}x{6f}x{20}".
              "x{b1}x{b1}x{be}x{a9}x{20}x{50}x{65}x{72}x{6c}x{20}".
              "x{77}x{6f}x{72}x{6b}x{73}x{68}x{6f}x{70}x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030

use Encode;
my $string = decode($encoding_name, $unknown);

我发现 encode 'ascii' 是摆脱非 ASCII 字符的蹩脚解决方案.一切都会被问号代替;这太有损而无用.

I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.

# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.

如果你想要可读的 ASCII 文本,我推荐 Text::Unidecode.这也是一种有损编码,但不像上面的纯 encode 那样糟糕.

If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.

use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing  Perl workshop.

但是,如果可以,请避免使用那些有损编码.如果您想稍后撤消操作,请选择 PERLQQXMLCREF 之一.

However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.

use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ);  # This year I went to x{5317}x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.

这篇关于我如何猜测 Perl 中字符串的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆