我如何猜测Perl中字符串的编码? [英] How can I guess the encoding of a string in Perl?

查看:104
本文介绍了我如何猜测Perl中字符串的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Unicode字符串,不知道它的编码是什么.当Perl程序读取此字符串时,Perl将使用默认编码吗?如果是这样,我怎么知道它是什么?

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?

我正在尝试摆脱输入中的非ASCII字符.我在某个论坛上找到了它,

I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:

my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});

在未指定输入编码的情况下,上述方法将如何工作?是否应指定如下所示?

How will the above work when no input encoding is specified? Should it be specified like the following?

my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});

推荐答案

要找出未知编码在哪种编码中使用,您只需尝试一下即可.模块 Encode :: Detect Encode :: Detective .)

To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)

use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
              "\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
              "\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
              "\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030

use Encode;
my $string = decode($encoding_name, $unknown);

我发现encode 'ascii'是摆脱非ASCII字符的la脚解决方案.一切都将替换为问号;这太有损了而无用.

I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.

# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.

如果您想要可读的ASCII文本,建议改用 Text :: Unidecode .这也是一种有损编码,但不如上面的encode可怕.

If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.

use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing  Perl workshop.

但是,如果可以的话,请避免使用那些有损编码.如果以后要撤消操作,请选择PERLQQXMLCREF之一.

However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.

use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ);  # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.

这篇关于我如何猜测Perl中字符串的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆