perl:未捕获的异常:JSON字符串中格式错误的UTF-8字符 [英] perl: Uncaught exception: malformed UTF-8 character in JSON string

查看:171
本文介绍了perl:未捕获的异常:JSON字符串中格式错误的UTF-8字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题

Related to this question and this answer (to another question) I am still unable to process UTF-8 with JSON.

我试图确保根据最佳专家的建议调用所有所需的伏都教徒,据我所知,该字符串尽可能有效,标记并标记为UTF-8.但是仍然会死于

I have tried to make sure all the required voodoo is invoked based on recommendations from the very best experts, and as far as I can see the string is as valid, marked and labelled as UTF-8 as possible. But still perl dies with either

Uncaught exception: malformed UTF-8 character in JSON string

Uncaught exception: Wide character in subroutine entry

我在做什么错了?

(hlovdal) localhost:/work/2011/perl_unicode>cat json_malformed_utf8.pl 
#!/usr/bin/perl -w -CSAD

### BEGIN ###
# Apparently the very best perl unicode boiler template code that exist,
# https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129
# Slightly modified.

use v5.12; # minimal for unicode string feature
#use v5.14; # optimal for unicode string feature

use utf8;                                                 # Declare that this source unit is encoded as UTF‑8. Although
                                                          # once upon a time this pragma did other things, it now serves
                                                          # this one singular purpose alone and no other.
use strict;
use autodie;

use warnings;                                             # Enable warnings, since the previous declaration only enables
use warnings    qw< FATAL  utf8     >;                    # strictures and features, not warnings. I also suggest
                                                          # promoting Unicode warnings into exceptions, so use both
                                                          # these lines, not just one of them. 

use open        qw( :encoding(UTF-8) :std );              # Declare that anything that opens a filehandles within this
                                                          # lexical scope but not elsewhere is to assume that that
                                                          # stream is encoded in UTF‑8 unless you tell it otherwise.
                                                          # That way you do not affect other module’s or other program’s code.

use charnames   qw< :full >;                              # Enable named characters via \N{CHARNAME}.
use feature     qw< unicode_strings >;

use Carp                qw< carp croak confess cluck >;
use Encode              qw< encode decode >;
use Unicode::Normalize  qw< NFD NFC >;

END { close STDOUT }

if (grep /\P{ASCII}/ => @ARGV) { 
   @ARGV = map { decode("UTF-8", $_) } @ARGV;
}

$| = 1;

binmode(DATA, ":encoding(UTF-8)");                        # If you have a DATA handle, you must explicitly set its encoding.

# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
    confess "Uncaught exception: @_" unless $^S;
};

# now promote run-time warnings into stackdumped exceptions
#   *unless* we're in an try block, in which 
#   case just generate a clucking stackdump instead
local $SIG{__WARN__} = sub {
    if ($^S) { cluck   "Trapped warning: @_" } 
    else     { confess "Deadly warning: @_"  }
};

### END ###


use JSON;
use Encode;

use Getopt::Long;
use Encode;

my $use_nfd = 0;
my $use_water = 0;
GetOptions("nfd" => \$use_nfd, "water" => \$use_water );

print "JSON->backend->is_pp = ", JSON->backend->is_pp, ", JSON->backend->is_xs = ", JSON->backend->is_xs, "\n";

sub check {
        my $text = shift;
        return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (Encode::is_utf8($text, 1) ? "1" : "0"). ". ";
}

my $json_text = "{ \"my_test\" : \"hei på deg\" }\n";
if ($use_water) {
        $json_text = "{ \"water\" : \"水\" }\n";
}
if ($use_nfd) {
        $json_text = NFD($json_text);
}

print check($json_text), "\$json_text = $json_text";

# test from perluniintro(1)
if (eval { decode_utf8($json_text, Encode::FB_CROAK); 1 }) {
        print "string is valid utf8\n";
} else {
        print "string is not valid utf8\n";
}

my $hash_ref1 = JSON->new->utf8->decode($json_text);
my $hash_ref2 = decode_json( $json_text );

__END__

运行此给出

(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl 
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei på deg" }
string is valid utf8
Uncaught exception: malformed UTF-8 character in JSON string, at character offset 20 (before "\x{5824}eg" }\n") at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('malformed UTF-8 character in JSON string, at character offset...') called at ./json_malformed_utf8.pl line 96
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl | ./uniquote 
Uncaught exception: malformed UTF-8 character in JSON string, at character offset 20 (before "\x{5824}eg" }\n") at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('malformed UTF-8 character in JSON string, at character offset...') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei p\N{U+E5} deg" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -nfd | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "my_test" : "hei pa\N{U+30A} deg" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water 
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "水" }
string is valid utf8
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "\N{U+6C34}" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>./json_malformed_utf8.pl -water --nfd | ./uniquote 
Uncaught exception: Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.
 at ./json_malformed_utf8.pl line 46
        main::__ANON__('Wide character in subroutine entry at ./json_malformed_utf8.pl line 96.\x{a}') called at ./json_malformed_utf8.pl line 96
JSON->backend->is_pp = 0, JSON->backend->is_xs = 1
is_utf8(): 1, is_utf8(1): 1. $json_text = { "water" : "\N{U+6C34}" }
string is valid utf8
(hlovdal) localhost:/work/2011/perl_unicode>rpm -q perl perl-JSON perl-JSON-XS
perl-5.12.4-159.fc15.x86_64
perl-JSON-2.51-1.fc15.noarch
perl-JSON-XS-2.30-2.fc15.x86_64
(hlovdal) localhost:/work/2011/perl_unicode>

单引号来自 http://training.perl.com/scripts/uniquote

更新:

感谢brian重点介绍了解决方案.更新源以将json_text用于所有普通字符串,并将json_bytes用于将要传递给JSON的内容,如下所示:

Thanks to brian for highlighting the solution. Updating the source to use json_text for all normal strings and json_bytes for what is going to be passed to JSON like the following now works like expected:

my $json_bytes = encode('UTF-8', $json_text);
my $hash_ref1 = JSON->new->utf8->decode($json_bytes);

我必须说,我认为JSON模块的文档非常不清楚,并且在某种程度上具有误导性.

I must say that I think the documentation for the JSON module is extremely unclear and partially misleading.

文字"一词​​(至少对我来说)表示一串字符. 因此,在阅读$perl_scalar = decode_json $json_text时,我有一个 期望json_text是UTF-8编码的字符串. 彻底重新阅读文档,知道要查找的内容, 我现在看到它说:"decode_json ...期望使用UTF-8(二进制)字符串,并尝试解析 作为UTF-8编码的JSON文本",但是在我看来仍然不清楚.

The phrase "text" (at least to me) implies a string of characters. So when reading $perl_scalar = decode_json $json_text I have an expectation of json_text being a UTF-8 encoded string of characters. Thoroughly re-reading the documentation, knowing what to look for, I now see it says: "decode_json ... expects an UTF-8 (binary) string and tries to parse that as an UTF-8 encoded JSON text", however that still is not clear in my opinion.

从我的背景开始,使用一种具有其他一些非ASCII的语言 字符,我记得在过去,您不得不猜测代码 页面被使用,电子邮件过去只是通过剥去 第8位,等等.在字符串上下文中,二进制"表示字符串 包含7位ASCII域之外的字符.但是什么是 二进制"真的吗?不是所有字符串在核心级别都是二进制的吗?

From my background using a language having some additional non-ASCII characters, I remember back in the days where you had to guess the code page being used, email used to just cripple text by stripping of the 8th bit, etc. And "binary" in the context of strings meant a string containing characters outside the 7-bit ASCII domain. But what is "binary" really? Isn't all strings binary at the core level?

文档还说简单而快速的接口(期望/生成UTF-8)"和正确的unicode处理",功能"下的第一点,但都没有提及它不需要字符串,而是一个字符串.字节序列.我会要求 作者至少要弄清楚一点.

The documentation also says "simple and fast interfaces (expect/generate UTF-8)" and "correct unicode handling", first point under "Features", both without mentioning anywhere near that it does not want a string but instead a byte sequence. I will request the author to at least make this clearer.

推荐答案

我在通过阅读JSON文档,我认为那些函数不需要字符串,但这就是您要提供的字符串.相反,他们需要"UTF-8二进制字符串".这对我来说似乎很奇怪,但是我猜测主要是直接从HTTP消息中获取输入,而不是直接在程序中键入内容.之所以有效,是因为我创建了一个字节字符串,它是字符串的UTF-8编码版本:

From reading the JSON docs, I think those functions don't want a character string, but that's what you're trying to give it. Instead, they want a "UTF-8 binary string". That seems odd to me, but I'm guessing that it's mostly to take input directly from an HTTP message instead of something that you type directly in your program. This works because I make a byte string that's the UTF-8 encoded version of your string:

use v5.14;

use utf8;                                                 
use warnings;                                             
use feature     qw< unicode_strings >;

use Data::Dumper;
use Devel::Peek;
use JSON;

my $filename = 'hei.txt';
my $char_string = qq( { "my_test" : "hei på deg" } );
open my $fh, '>:encoding(UTF-8)', $filename;
print $fh $char_string;
close $fh;


{
say '=' x 70;
my $byte_string = qq( { "my_test" : "hei p\303\245 deg" } );
print "Byte string peek:------\n"; Dump( $byte_string );
decode( $byte_string );
}


{
say '=' x 70;
my $raw_string = do { 
    open my $fh, '<:raw', $filename;
    local $/; <$fh>;
    };
print "raw string peek:------\n"; Dump( $raw_string );

decode( $raw_string );
}

{
say '=' x 70;
my $char_string = do { 
    open my $fh, '<:encoding(UTF-8)', $filename;
    local $/; <$fh>;
    };
print "char string peek:------\n"; Dump( $char_string );

decode( $char_string );
}

sub decode {
    my $string = shift;

    my $hash_ref2 = eval { decode_json( $string ) };
    say "Error in sub form: $@" if $@;
    print Dumper( $hash_ref2 );

    my $hash_ref1 = eval { JSON->new->utf8->decode( $string ) };
    say "Error in method form: $@" if $@;
    print Dumper( $hash_ref1 );
    }

输出显示字符串不起作用,但是字节字符串版本起作用:

The output shows that the character string doesn't work, but the byte string versions do:

======================================================================
Byte string peek:------
SV = PV(0x100801190) at 0x10089d690
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100209890 " { \"my_test\" : \"hei p\303\245 deg\" } "\0
  CUR = 31
  LEN = 32
$VAR1 = {
          'my_test' => "hei p\x{e5} deg"
        };
$VAR1 = {
          'my_test' => "hei p\x{e5} deg"
        };
======================================================================
raw string peek:------
SV = PV(0x100839240) at 0x10089d780
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x100212260 " { \"my_test\" : \"hei p\303\245 deg\" } "\0
  CUR = 31
  LEN = 32
$VAR1 = {
          'my_test' => "hei p\x{e5} deg"
        };
$VAR1 = {
          'my_test' => "hei p\x{e5} deg"
        };
======================================================================
char string peek:------
SV = PV(0x10088f3b0) at 0x10089d840
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002017b0 " { \"my_test\" : \"hei p\303\245 deg\" } "\0 [UTF8 " { "my_test" : "hei p\x{e5} deg" } "]
  CUR = 31
  LEN = 32
Error in sub form: malformed UTF-8 character in JSON string, at character offset 21 (before "\x{5824}eg" } ") at utf-8.pl line 51.

$VAR1 = undef;
Error in method form: malformed UTF-8 character in JSON string, at character offset 21 (before "\x{5824}eg" } ") at utf-8.pl line 55.

$VAR1 = undef;

因此,如果您将直接输入到程序中的字符串作为字符串,然后将其转换为UTF-8编码的字节字符串,它将起作用:

So, if you take your character string, which you typed directly into your program, and convert it to a UTF-8 encoded byte string, it works:

use v5.14;

use utf8;                                                 
use warnings;                                             
use feature     qw< unicode_strings >;

use Data::Dumper;
use Encode qw(encode_utf8);
use JSON;

my $char_string = qq( { "my_test" : "hei på deg" } );

my $string = encode_utf8( $char_string );

decode( $string );

sub decode {
    my $string = shift;

    my $hash_ref2 = eval { decode_json( $string ) };
    say "Error in sub form: $@" if $@;
    print Dumper( $hash_ref2 );

    my $hash_ref1 = eval { JSON->new->utf8->decode( $string ) };
    say "Error in method form: $@" if $@;
    print Dumper( $hash_ref1 );
    }

我认为JSON应该足够聪明来处理此问题,因此您不必在此级别上进行思考,但这就是目前的方式.

I think JSON should be smart enough to deal with this so you don't have to think at this level, but that's the way it is (so far).

这篇关于perl:未捕获的异常:JSON字符串中格式错误的UTF-8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆