为什么我的Perl程序以Tie :: File和Unicode/UTF-8编码失败? [英] Why is my Perl program failing with Tie::File and Unicode/UTF-8 encoding?

查看:128
本文介绍了为什么我的Perl程序以Tie :: File和Unicode/UTF-8编码失败?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事一个处理外语数据的项目.我的Perl脚本运行良好.

I am working on a project which deals with data in foreign languages. My Perl scripts were running fine.

然后我想使用Tie :: File,因为这是一个简洁的概念(并节省了时间和编码).

I then wanted to use Tie::File, since this is a neat concept (and saves time and coding).

似乎Tie:File在Unicode/UTF-8下失败了(除非我丢失了一些东西).

It seems that Tie:File is failing under Unicode/UTF-8 (unless I am missing something).

以下是描述问题的程序:(数据混合了英语,希腊语和希伯来语):

Here is a program which depicts the problem: (The data is a mix of English, Greek and Hebrew):

use strict;
 use warnings;
 use 5.014; 
 use Win32::Console;
 use autodie; 
 use warnings qw< FATAL utf8 >;
 use Carp;
 use Carp::Always;
 use utf8;
 use feature        qw< unicode_strings>;
 use charnames      qw< :full>;
use Tie::File;

my ($i);
my ( $FileName);
my (@Tied);
binmode STDOUT, ':unix:utf8';
binmode STDERR, ':unix:utf8';
binmode $DB::OUT, ':unix:utf8' if $DB::OUT; # for the debugger
Win32::Console::OutputCP(65001);         # Set the console code page to UTF8

$FileName = 'E:\\My Documents\\Technical\\Perl\\Eclipse workspace\\Work\\'.
        'Tie File test res.txt';
tie @Tied, 'Tie::File', $FileName, recsep => "\x0D\x0A", discipline => ':encoding(utf8)'
            or confess 'tie @Tied failed';
$i =0;
while (<DATA>) {
    chomp;
    $Tied[$i] = $_;
    ++$i;
} # end while (<DATA>) 
$i =0;
foreach (@Tied) {
    say "$i $Tied[$i]";
    ++$i;
} # end foreach (@Tied)
untie $FileName;
__DATA__
τι κάνετε;
πάρτε το ή αφήστε το
שלום חברים
abc לא כןכן efg
מתי ולאן This is it
מעכשיו לעכשיו 
Σήμερα είναι Τρίτη
Θέλω να φάω
τι κάνετε;
שורה מס' 5

这会产生大量的警告:这是一些:

This produces a huge cascade of warnings: here is some:

utf8 "\xCE" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, 'τι κάνετε;') called at tie file test
.pl line 31
utf8 "\xCF" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, 'τι κάνετε;') called at tie file test
.pl line 31
utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, 'τι κάνετε;') called at tie file test
.pl line 31
utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, 'τι κάνετε;') called at tie file test
.pl line 31

然后将其打印在STDOUT上:

Then it prints this on STDOUT:

0 τι κάνετε;
1 πάρτε το ή αφήστε το
2 שלום חברים
3 abc לא כןכן efg
4 מתי ולאן This is it
5 מעכשיו לעכשיו
6 Σήμερα είναι Τρίτη
7 Θέλω να φάω
8 τι κάνετε;
9 שורה מס' 5
10
11
12
13
14 \xA4\xΘέλω\xA8\x

15
16
17
18

19

请注意,前10行是可以的,但第10至19行却无处可寻! 另外,绑定文件的输出包含损坏的数据:

Note that the first 10 lines are OK, but lines 10 through 19 came from nowhere!? In addition, the output of the tied file contains corrupted data:

 τι κάνϏN͏Ŏՠτήστε של חברءbc לؗܗࠗܗߠeמתולאן This is מעיו לעכ؎Ďώݎ֏ναι ΤρΘέώގѠφϏŎ٠κτε;שרה מס'



\xA4\xΘέλω\xA8\x

这里有些错误.我丢失了什么,还是Tie:File无法处理Unicode/UTF-8? 我在Windows 7系统上运行Strawberry Perl 5.14.

Something is very wrong here. Either I am missing something, or Tie:File can't cope with Unicode/UTF-8? I am running Strawberry Perl 5.14 on a Windows 7 system.

许多TIA-海伦

注意:也发布在 http://perlmonks.org/?node_id=1002104

Note: posted on http://perlmonks.org/?node_id=1002104, too

推荐答案

我的建议很大程度上取决于您要解决的实际问题. 孤立地看这个问题,我不会有太多的编码/解码魔术",而只会使用原始字节(因为脚本不需要为此了解字符本身的任何信息任务).给定您描述的输入和输出,下面将产生预期的结果.

The suggestion I would make depends very much on the actual problem you're trying to solve. Looking at this question in isolation, I would not have so much encoding / decoding 'magic' and would simply use the raw bytes (as the script doesn't need to know anything about the characters themselves for this task). The below produces the expected result given the input and output you described.

use v5.014;
use warnings;
use autodie;

use Carp::Always;
use Tie::File;

my $file_in = 'test_in.txt';
my $file_out = 'test_tie.txt';

unlink $file_out;

tie my @tied, 'Tie::File', $file_out, recsep => "\x0D\x0A" or die 'tie failed';

open my $fh, '<', $file_in;
while (my $line = <$fh>) {
    chomp $line;
    push @tied, $line;
}
close $fh;

my $i = 0;
say $i++ . ' ' . $_ foreach @tied;

untie @tied;

但是,您可能确实想对中间的文本进行一些处理.在这种情况下,您需要解码的字符.在我看来,有两种选择:

However, you probably do want to do some processing on that text in the middle. In which case you want decoded characters. As I see it there are two options:

  1. 在移交给绑定数组之前手动编码
  2. 找出Tie :: File的问题所在

数字2可能是不平凡的-快速扫描Tie :: File源,并且看起来它假定将始终被赋予字节.您似乎可以影响的唯一部分是位于 https://metacpan.org/source/TODDR/Tie-File-0.98/lib/Tie/File.pm#L111 -您正在做的.

Number 2 is probably non-trivial - a quick scan of the Tie::File source and it looks like it assumes it will always be given bytes. The only part that you can seemingly affect is the binmode at https://metacpan.org/source/TODDR/Tie-File-0.98/lib/Tie/File.pm#L111 - which you are doing.

Tie :: File进行了许多seek调用,perldoc在搜索中有此要求( http://perldoc.perl.org/functions/seek.html ):

Tie::File does a lot of seek calls, perldoc has this to say on seek ( http://perldoc.perl.org/functions/seek.html ):

请注意(以字节为单位):即使已将文件句柄设置为对字符进行操作(例如,通过使用:encoding(utf8)开放层),tell()也会返回字节偏移量,而不是字符偏移量(因为实施该方法会使seek()和tell()变得很慢).

Note the in bytes: even if the filehandle has been set to operate on characters (for example by using the :encoding(utf8) open layer), tell() will return byte offsets, not character offsets (because implementing that would render seek() and tell() rather slow).

因此看来Tie :: File使用字符长度来确定其记录的字节偏移量.因此,它可能会出现在UTF-8字符序列的中间.这可能是导致您出错的原因.

So it appears that Tie::File is using character lengths to determine its byte offsets for records. Therefore it can end up in the middle of a UTF-8 character sequence. This seems a likely cause for your errors.

通常,当依靠外部模块读取/写入文件句柄时,我会远离binmode-在这种情况下,在推送到@tied之前,我将对数据进行简单的子调用Encode::encode('UTF-8', ...).

In general, I stay away from binmode when relying on an external module to read/write to a file handle - in this case I would have a simple sub calling Encode::encode('UTF-8', ...) on the data before pushing onto @tied.

模块文档在其中清楚地说明了解码数据的行为,或者如果源足够简单,我无法验证其行为,则为异常.

Exception is where the module's documentation clearly states the behaviour for decoded data or if the source is simple enough for me to verify the behaviour.

这篇关于为什么我的Perl程序以Tie :: File和Unicode/UTF-8编码失败?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆