尝试改进 Encode::decode 警告消息:$SIG{__WARN__} 处理程序中的段错误 [英] Trying to improve Encode::decode warning message: Segfault in $SIG{__WARN__} handler

查看:42
本文介绍了尝试改进 Encode::decode 警告消息:$SIG{__WARN__} 处理程序中的段错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试改进 Encode::decode().我希望它不打印模块名称和模块中的行号,而是打印正在读取的文件的名称以及发现格式错误数据的文件中的行号.对于开发人员来说,原始消息可能很有用,但对于不熟悉 Perl 的最终用户来说,它可能毫无意义.最终用户可能更想知道是哪个文件出现了问题.

我首先尝试使用 $SIG{__WARN__} 处理程序(这可能不是一个好主意),但是我遇到了段错误.可能是一个愚蠢的错误,但我无法弄清楚:

#!/usr/bin/env perl使用特征 qw(say);使用严格;使用警告;使用编码();binmode STDOUT, ':utf8';binmode STDERR, ':utf8';我的 $fn = 'test.txt';write_test_file( $fn );# 尝试改进 Encode::FB_WARN 回退警告消息:## utf8 "\xE5" 不会映射到  处的 Unicode;第 xx 行## 相反,我们希望警告打印文件名和行号:## utf8 "\xE5" 不会在文件  的第 xx 行映射到 Unicode.我的 $str = '';open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";{本地 $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) };$str = do { 本地 $/;<$fh>};}关闭 $fh;说读取字符串:'$str'";子 my_warn_handler {我的 ( $fn, $msg ) = @_;if ( $msg =~/\Q 不映射到 Unicode\E/) {recovery_line_number_and_char_pos( $fn, $msg );}别的 {警告 $msg;}}子recover_line_number_and_char_pos {我的 ( $fn, $err_msg ) = @_;咀嚼 $err_msg;$err_msg =~ s/(line \d+)\.$/$1/;# 去掉句尾的句号.open ( $fh, "<:raw", $fn ) or die "无法打开文件 '$fn': $!";我的 $raw_data = 做 { 本地 $/;<$fh>};关闭 $fh;我的 $str = Encode::decode('utf-8', $raw_data, Encode::FB_QUIET);我的 ($header, $last_line) = $str =~/^(.*\n)([^\n]*)$/s;我的 $line_no = $str =~ tr/\n//;++$line_no;我的 $pos = ( 长度 $last_line ) + 1;警告$err_msg,在文件‘$fn’中(行:$line_no,pos:$pos)\n";}子 write_test_file {我的 ( $fn ) = @_;my $bytes = "Hello\nA\x{E5}\x{61}";# 2 行以 iso 8859-1 结尾:åa打开(我的 $fh, '>:raw', $fn )或死无法打开文件 '$fn': $!";打印 $fh $bytes;关闭 $fh;}

输出:

utf8 "\xE5" 在 ./p.pl 第 27 行没有映射到 Unicode, 在文件 'test.txt' (line: 2, pos: 2)分段错误(核心转储)

解决方案

这是另一种定位警告触发位置的方法,使用未缓冲的 sysread

使用警告;使用严格;binmode STDOUT, ':utf8';binmode STDERR, ':utf8';我的 $file = 'test.txt';打开我的 $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";$SIG{__WARN__} = sub { print "\t==> WARN: @_" };我的 $char_cnt = 0;我的 $char;while (sysread($fh, $char, 1)) {++$char_cnt;打印 "$char ($char_cnt)\n";}

文件 test.txt 是由发布的程序编写的,只是我必须添加到它来重现行为——它在 v5.10 和 v5.16 上运行时没有警告.我在末尾添加了 \x{234234} .可以使用 $char =~/\n/ 跟踪行号.

sysread 在出错时返回 undef.它可以移动到 while (1) 的主体中以允许继续读取并捕获所有警告,在 0 上中断(在 EOF 时返回).

打印出来

<前>高 (1)电子 (2)升 (3)升 (4)Ø (5)(6)一 (7)å (8)一 (9)==> 警告:代码点 0x234234 不是 Unicode,可能无法在 ...(10)

虽然这确实捕获了警告的字符,但使用 Encode 重新读取文件可能比使用 sysread 更好,尤其是在 sysread 使用 Encode.

但是,Perl 在内部是 utf8,我不确定 sysread 是否需要 Encode.

注意.sysread 的页面支持将其用于具有编码层的数据

<块引用>

注意如果文件句柄已经被标记为:utf8 Unicode读取字符而不是字节(LENGTH、OFFSET 和sysread 的返回值是 Unicode 字符).这:encoding(...) 层隐含地引入了 :utf8 层.请参阅 binmodeopenopen 编译指示.

<小时>

注意  显然,在某个版本sysread 不支持编码层之后,事情已经发生了变化.上面的链接,对于旧版本(v5.10)确实显示了引用的内容,新版本告诉我们会有例外.

I am trying to improve the warning message issued by Encode::decode(). Instead of printing the name of the module and the line number in the module, I would like it to print the name of the file being read and the line number in that file where the malformed data was found. To a developer, the origial message can be useful, but to an end user not familiar with Perl, it is probably quite meaningless. The end user would probably rather like to know which file is giving the problem.

I first tried to solve this using a $SIG{__WARN__} handler (which is probably not a good idea), but I get a segfault. Probably a silly mistake, but I could not figure it out:

#! /usr/bin/env perl

use feature qw(say);
use strict;
use warnings;

use Encode ();

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $fn = 'test.txt';
write_test_file( $fn );

# Try to improve the Encode::FB_WARN fallback warning message :
#
#   utf8 "\xE5" does not map to Unicode at <module_name> line xx
#
# Rather we would like the warning to print the filename and the line number:
#
#   utf8 "\xE5" does not map to Unicode at line xx of file <filename>.

my $str = '';
open ( my $fh, "<:encoding(utf-8)", $fn ) or die "Could not open file '$fn': $!";
{
    local $SIG{__WARN__} = sub { my_warn_handler( $fn, $_[0] ) }; 
    $str = do { local $/; <$fh> };
}
close $fh;
say "Read string: '$str'";

sub my_warn_handler {
    my ( $fn, $msg ) = @_;

    if ( $msg =~ /\Qdoes not map to Unicode\E/ ) {
        recover_line_number_and_char_pos( $fn, $msg );
    }
    else {
        warn $msg;
    }
}

sub recover_line_number_and_char_pos {
    my ( $fn, $err_msg ) = @_;

    chomp $err_msg;
    $err_msg =~ s/(line \d+)\.$/$1/;  # Remove period at end of sentence.
    open ( $fh, "<:raw", $fn ) or die "Could not open file '$fn': $!";
    my $raw_data = do { local $/; <$fh> };
    close $fh;
    my $str = Encode::decode( 'utf-8', $raw_data, Encode::FB_QUIET );
    my ($header, $last_line) = $str =~ /^(.*\n)([^\n]*)$/s; 
    my $line_no = $str =~ tr/\n//;
    ++$line_no;
    my $pos = ( length $last_line ) + 1;
    warn "$err_msg, in file '$fn' (line: $line_no, pos: $pos)\n";
}

sub write_test_file {
    my ( $fn ) = @_;

    my $bytes = "Hello\nA\x{E5}\x{61}";  # 2 lines ending in iso 8859-1: åa
    open ( my $fh, '>:raw', $fn ) or die "Could not open file '$fn': $!";
    print $fh $bytes;
    close $fh;
}

Output:

utf8 "\xE5" does not map to Unicode at ./p.pl line 27
, in file 'test.txt' (line: 2, pos: 2)
Segmentation fault (core dumped)

解决方案

Here is another way to locate where the warning fires, with un-buffered sysread

use warnings;
use strict;

binmode STDOUT, ':utf8';
binmode STDERR, ':utf8';

my $file = 'test.txt';
open my $fh, "<:encoding(UTF-8)", $file or die "Can't open $file: $!";

$SIG{__WARN__} = sub { print "\t==> WARN: @_" };

my $char_cnt = 0;    
my $char;

while (sysread($fh, $char, 1)) {
    ++$char_cnt;
    print "$char ($char_cnt)\n";
}

The file test.txt was written by the posted program, except that I had to add to it to reproduce the behavior -- it runs without warnings on v5.10 and v5.16. I added \x{234234} to the end. The line number can be tracked with $char =~ /\n/.

The sysread returns undef on error. It can be moved into the body of while (1) to allow reads to continue and catch all warnings, breaking out on 0 (returned on EOF).

This prints

H (1)
e (2)
l (3)
l (4)
o (5)

 (6)
A (7)
å (8)
a (9)
        ==> WARN: Code point 0x234234 is not Unicode, may not be portable at ...
 (10)

While this does catch the character warned about, re-reading the file using Encode may well be better than reaching for sysread, in particular if sysread uses Encode.

However, Perl is utf8 internally and I am not sure that sysread needs Encode.

Note. The page for sysread supports its use on data with encoding layers

Note that if the filehandle has been marked as :utf8 Unicode characters are read instead of bytes (the LENGTH, OFFSET, and the return value of sysread are in Unicode characters). The :encoding(...) layer implicitly introduces the :utf8 layer. See binmode, open, and the open pragma.


Note   Apparently, things have moved on and after a certain version sysread does not support encoding layers. The link above, while for an older version (v5.10 for one) indeed shows what is quoted, with a newer version tells us that there'll be an exception.

这篇关于尝试改进 Encode::decode 警告消息:$SIG{__WARN__} 处理程序中的段错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆