Perl + Curses:期望从getchar()获得UTF-8编码的多字节字符,但没有得到任何 [英] Perl + Curses: Expecting a UTF-8 encoded multibyte character from getchar(), but not getting any

查看:91
本文介绍了Perl + Curses:期望从getchar()获得UTF-8编码的多字节字符,但没有得到任何的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试Bryan Henderson的ncurses库的Perl接口: Curses

I am trying out Bryan Henderson's Perl interface to the ncurses library: Curses

为简单练习,我尝试获取在屏幕上键入的单个字符.它直接基于 NCURSES编程指南,并进行了改编

For a simple exercise, I try to obtain single characters typed on-screen. This is directly based off the NCURSES Programming HOWTO, with adaptations.

当我调用Perl库的getchar()时,我希望收到一个字符,可能是多字节(如库联机帮助页的这一部分,因为必须处理功能键的特殊情况,而无需输入,但这只是通常的问题.

When I call the Perl library's getchar(), I expect to receive a character, possibly multibyte (It's a bit more complicated as explained in this part of the library manpage because one has to handle the special cases of function keys and no input, but that's just the usual curlicues).

这是下面代码中的子例程read1ch().

It's the subroutine read1ch() in the code below.

这对于ASCII字符有效,但对于0x7F以上的字符无效.例如,当点击è(Unicode 0x00E8, UTF-8 时: 0xC3、0xA8),实际上是获得代码0xE8而不是UTF-8编码的代码.将其打印到LANG=en_GB.UTF-8不能正常工作的终端上,无论如何我都期待0xC3A8.

This works well for ASCII characters, but doesn't work for characters above 0x7F. For example, when hitting è (Unicode 0x00E8, UTF-8: 0xC3, 0xA8), I actually obtain code 0xE8 instead of something UTF-8 encoded. Printing it out to the terminal for which LANG=en_GB.UTF-8 is not working and anyway I was expecting 0xC3A8.

要使其正常工作,我需要做哪些更改,即获取è作为适当的字符还是Perl字符串?

What do I need to change to make it work, i.e. get the è either as a proper character or a Perl string?

getchar()的C代码是此处 btw .也许只是没有使用C_GET_WCH设置进行编译?怎么找出来?

The C code snipped for getchar() is here btw. Maybe it just didn't get compiled with C_GET_WCH set? How to find out?

尝试通过设置 binmode

binmode STDERR, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

这应该解决所有编码问题,因为终端需要并发送UTF-8,但这无济于事.

which should fix any encoding issues because the terminal expects and sends UTF-8, but that didn't help.

还尝试使用使用open 设置流编码(不太确定两者之间的区别这个和上面的方法),但这也无济于事

Also tried setting the stream encoding with use open (not quite sure about the difference between this and the approach above), but that didn't help either

use open qw(:std :encoding(UTF-8));

附录2

Perl Curses垫片的手册页说:

Addendum 2

The manpage for the Perl Curses shim says:

如果wget_wch()不可用(即Curses库不可用) 了解宽字符),这会调用wgetch() [获取1个字节的字符 从curses窗口],但返回 尽管如此,上述值.这可能是一个问题,因为使用 多字节字符编码(如UTF-8),您将收到两个 两字节字符的一个字符字符串(例如,Ã"和¤"表示 一种").

If wget_wch() is not available (i.e. The Curses library does not understand wide characters), this calls wgetch() [get a 1-byte char from a curses window], but returns the values described above nonetheless. This can be a problem because with a multibyte character encoding like UTF-8, you will receive two one-character strings for a two-byte-character (e.g. "Ã" and "¤" for "ä").

在这里可能是这种情况,但是wget_wch()在此系统上确实存在.

This may be the case here, but wget_wch() does exist on this system.

试图查看C代码的作用,并直接将fprintf添加到curses/Curses-1.36/CursesFunWide.c的多字节处理代码中,重新编译后,没有设法通过LD_LIBRARY_PATH用我自己的系统覆盖Curses.so(为什么为什么呢?为什么一切都只能在一半时间内工作?),因此直接替换了系统库(使用THAT!).

Tried to see what the C code does and added an fprintf directly into the multibyte handling code of curses/Curses-1.36/CursesFunWide.c, recompiled, didn't manage to override the system Curses.so with my own via LD_LIBRARY_PATH (why not? why is everything only working half of the time?), so replaced the system library directly in place (take THAT!).

#ifdef C_GET_WCH
    wint_t wch;
    int ret = wget_wch(win, &wch);
    if (ret == OK) {
        ST(0) = sv_newmortal();
        fprintf(stderr,"Obtained win_t 0x%04lx\n", wch);
        c_wchar2sv(ST(0), wch);
        XSRETURN(1);
    } else if (ret == KEY_CODE_YES) {
        XST_mUNDEF(0);
        ST(1) = sv_newmortal();
        sv_setiv(ST(1), (IV)wch);
        XSRETURN(2);
    } else {
        XSRETURN_UNDEF;
    }
#else

那只是一个胖的NOPE,当按下ü时,会看到:

That's just a fat NOPE, when pressing ü one sees:

Obtained win_t 0x00fc

因此,可以运行正确的代码,但是数据是 ISO-8859-1 ,而不是UTF-8.因此,wget_wch的行为不佳.因此,这是一个curses配置问题.嗯.

So the correct code is run, but the data is ISO-8859-1, not UTF-8. So it's wget_wch which behaves badly. So it's a curses config problem. Huh.

令我惊讶的是,也许ncurses假定使用默认语言环境,即C.为了使ncurses使用宽字符,必须初始化语言环境",这可能意味着将状态从未设置"(从而使ncurses退回到C)移动到设置为系统上的内容指示"(应该是LANG环境变量中的内容). ncurses的手册页说:

It struck me that maybe ncurses was assuming default locale, i.e. C. To make it ncurses work with wide characters, one has to "initialize the locale", which probably means moving state from "unset" (and thus making ncurses fall back to C) to "set to what the system indicates" (which should be what is in the LANG environment variable). The man page for ncurses says:

该库使用调用程序已初始化的语言环境. 通常是通过setlocale完成的:

The library uses the locale which the calling program has initialized. That is normally done with setlocale:

setlocale(LC_ALL,");

setlocale(LC_ALL, "");

如果未初始化语言环境,则库将假定字符 可以按照ISO-8859-1进行打印,以与某些旧版程序配合使用. 您应该初始化语言环境,而不要依赖于 尚未设置语言环境的库.

If the locale is not initialized, the library assumes that characters are printable as in ISO-8859-1, to work with certain legacy programs. You should initialize the locale and not rely on specific details of the library when the locale has not been setup.

这也不起作用,但我认为解决方案就在这条路上.

This didn't work either, but I feel that the solution is down that road.

来自win_t(显然与wchar_t相同)转换代码> CursesWide.c ,将从wget_wch()接收到的wint_t(在此处显示为wchar_t)转换为Perl字符串. SV是标量值"类型.

The win_t (apparently the same as wchar_t) conversion code from CursesWide.c, converts the wint_t (here seen as wchar_t) received from wget_wch() into a Perl string. SV is the "scalar value" type.

另请参见: https://perldoc.perl.org/perlguts.html

在这里插入了两个fprintf,以查看发生了什么情况:

Here with two fprintf inserted to see what is going on:

static void
c_wchar2sv(SV *    const sv,
           wchar_t const wc) {
/*----------------------------------------------------------------------------
  Set SV to a one-character (not -byte!) Perl string holding a given wide
  character
-----------------------------------------------------------------------------*/
    if (wc <= 0xff) {
        char s[] = { wc, 0 };
        fprintf(stderr,"Not UTF-8 string: %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF);
        sv_setpv(sv, s);
        SvPOK_on(sv);
        SvUTF8_off(sv);
    } else {
        char s[UTF8_MAXBYTES + 1] = { 0 };
        char *s_end = (char *)UVCHR_TO_UTF8((U8 *)s, wc);
        *s_end = 0;
        fprintf(stderr,"UTF-8 string: %02x %02x %02x\n", ((int)s[0])&0xFF, ((int)s[1])&0xFF, ((int)s[2])&0xFF);
        sv_setpv(sv, s);
        SvPOK_on(sv);
        SvUTF8_on(sv);
    }
}

使用perl-Curses测试代码

  • 尝试过perl-Curses-1.36-9.fc30.x86_64
  • 尝试过perl-Curses-1.36-11.fc31.x86_64
  • Test code using perl-Curses

    • Tried with perl-Curses-1.36-9.fc30.x86_64
    • Tried with perl-Curses-1.36-11.fc31.x86_64
    • 如果尝试这样做,请按BACKSPACE退出循环,因为不再解释CTRL-C.

      下面有很多代码,但关键区域标有----- Testing:

      A lot of code below, but the critical area is marked with ----- Testing:

      #!/usr/bin/perl
      
      # pmap -p PID
      # shows the per process using 
      # /usr/lib64/libncursesw.so.6.1
      # /usr/lib64/perl5/vendor_perl/auto/Curses/Curses.so
      
      # Trying https://metacpan.org/release/Curses
      
      use warnings;
      use strict;
      use utf8;          # Meaning "This lexical scope (i.e. file) contains utf8"
      
      use Curses;        # On Fedora: dnf install perl-Curses
      
      # This didn't fix it 
      # https://perldoc.perl.org/open.html
      
      use open qw(:std :encoding(UTF-8));
      
      # https://perldoc.perl.org/perllocale.html#The-setlocale-function
      
      use POSIX ();
      my $loc = POSIX::setlocale(&POSIX::LC_ALL, "");
      
      # ---
      # Surrounds the actual program
      # ---
      
      sub setup() {
         initscr();
         raw();
         keypad(1);
         noecho();
      }
      
      sub teardown {
         endwin();
      }
      
      # ---
      # Mainly for prettyprinting
      # ---
      
      my $special_keys = setup_special_keys();
      
      # ---
      # Error printing
      # ---
      
      sub mt {
         return sprintf("%i: ",time());
      }
      
      sub ae {
         my ($x,$fname) = @_;
         if ($x == ERR) { 
            printw mt();
            printw "Got error code from '$fname': $x\n"
         }
      }
      
      # ---
      # Where the action is
      # ---
      
      sub announce {
         my $res = printw "Type any character to see it in bold! (or backspace to exit)\n";
         ae($res, "printw");
         return { refresh => 1 }
      }
      
      sub read1ch {
         # Read a next character, waiting until it is there.
         # Use the wide-character aware functions unless you want to deal with
         # collating individual bytes yourself!
         # Readings:
         # https://metacpan.org/pod/Curses#Wide-Character-Aware-Functions
         # https://perldoc.perl.org/perlunicode.html#Unicode-Character-Properties
         # https://www.ahinea.com/en/tech/perl-unicode-struggle.html
         # https://hexdump.wordpress.com/2009/06/19/character-encoding-issues-part-ii-perl/
         my ($ch, $key) = getchar();
         if (defined $key) {
            # it's a function key
            printw "Function key pressed: $key"; 
            printw " with known alias '" . $$special_keys{$key} . "'" if (exists $$special_keys{$key});
            printw "\n";
            # done if backspace was hit
            return { done => ($key == KEY_BACKSPACE()) }
         }
         elsif (defined $ch) {
            # "$ch" should be a String of 1 character
      
            # ----- Testing
      
            printw "Locale: $loc\n";
            printw "Multibyte output test: öüäéèà периоду\n";
            printw sprintf("Received string '%s' of length %i with ordinal 0x%x\n", $ch, length($ch), ord($ch));
      
            {
               # https://perldoc.perl.org/bytes.html
               use bytes;
               printw sprintf("... length is %i\n"     , length($ch));
               printw sprintf("... contents are %vd\n" , $ch);
            }
      
            # ----- Testing
      
            return { ch => $ch }
         }
         else {
            # it's an error
            printw "getchar() failed\n";
            return {}
         }
      }
      
      sub feedback {
         my ($ch) = @_;
         printw "The pressed key is: ";
         attron(A_BOLD);
         printw("%s\n","$ch"); # do not print $txt directly to make sure escape sequences are not interpreted!
         attroff(A_BOLD);
         return { refresh => 1 }  # should refresh
      }
      
      sub do_curses_run {
      
         setup;
      
         my $done = 0;
         while (!$done) {
            my $bubl;
            $bubl = announce(); 
            refresh() if $$bubl{refresh};
            $bubl = read1ch();
            $done = $$bubl{done};
            if (defined $$bubl{ch}) {
               $bubl = feedback($$bubl{ch}); 
               refresh() if $$bubl{refresh};
            }
         }
      
         teardown;
      }
      
      # ---
      # main
      # ---
      
      do_curses_run();
      
      
      sub setup_special_keys {
         # the key codes on the left must be called once to resolve to a numeric constant!
         my $res = {
            KEY_BREAK()       => "Break key",
            KEY_DOWN()        => "Arrow down",
            KEY_UP()          => "Arrow up",
            KEY_LEFT()        => "Arrow left",
            KEY_RIGHT()       => "Arrow right",
            KEY_HOME()        => "Home key",
            KEY_BACKSPACE()   => "Backspace",
            KEY_DL()          => "Delete line",
            KEY_IL()          => "Insert line",
            KEY_DC()          => "Delete character",
            KEY_IC()          => "Insert char or enter insert mode",
            KEY_EIC()         => "Exit insert char mode",
            KEY_CLEAR()       => "Clear screen",
            KEY_EOS()         => "Clear to end of screen",
            KEY_EOL()         => "Clear to end of line",
            KEY_SF()          => "Scroll 1 line forward",
            KEY_SR()          => "Scroll 1 line backward (reverse)",
            KEY_NPAGE()       => "Next page",
            KEY_PPAGE()       => "Previous page",
            KEY_STAB()        => "Set tab",
            KEY_CTAB()        => "Clear tab",
            KEY_CATAB()       => "Clear all tabs",
            KEY_ENTER()       => "Enter or send",
            KEY_SRESET()      => "Soft (partial) reset",
            KEY_RESET()       => "Reset or hard reset",
            KEY_PRINT()       => "Print or copy",
            KEY_LL()          => "Home down or bottom (lower left)",
            KEY_A1()          => "Upper left of keypad",
            KEY_A3()          => "Upper right of keypad",
            KEY_B2()          => "Center of keypad",
            KEY_C1()          => "Lower left of keypad",
            KEY_C3 ()         => "Lower right of keypad",
            KEY_BTAB()        => "Back tab key",
            KEY_BEG()         => "Beg(inning) key",
            KEY_CANCEL()      => "Cancel key",
            KEY_CLOSE()       => "Close key",
            KEY_COMMAND()     => "Cmd (command) key",
            KEY_COPY()        => "Copy key",
            KEY_CREATE()      => "Create key",
            KEY_END()         => "End key",
            KEY_EXIT()        => "Exit key",
            KEY_FIND()        => "Find key",
            KEY_HELP()        => "Help key",
            KEY_MARK()        => "Mark key",
            KEY_MESSAGE()     => "Message key",
            KEY_MOUSE()       => "Mouse event read",
            KEY_MOVE()        => "Move key",
            KEY_NEXT()        => "Next object key",
            KEY_OPEN()        => "Open key",
            KEY_OPTIONS()     => "Options key",
            KEY_PREVIOUS()    => "Previous object key",
            KEY_REDO()        => "Redo key",
            KEY_REFERENCE()   => "Ref(erence) key",
            KEY_REFRESH()     => "Refresh key",
            KEY_REPLACE()     => "Replace key",
            KEY_RESIZE()      => "Screen resized",
            KEY_RESTART()     => "Restart key",
            KEY_RESUME()      => "Resume key",
            KEY_SAVE()        => "Save key",
            KEY_SBEG()        => "Shifted beginning key",
            KEY_SCANCEL()     => "Shifted cancel key",
            KEY_SCOMMAND()    => "Shifted command key",
            KEY_SCOPY()       => "Shifted copy key",
            KEY_SCREATE()     => "Shifted create key",
            KEY_SDC()         => "Shifted delete char key",
            KEY_SDL()         => "Shifted delete line key",
            KEY_SELECT()      => "Select key",
            KEY_SEND()        => "Shifted end key",
            KEY_SEOL()        => "Shifted clear line key",
            KEY_SEXIT()       => "Shifted exit key",
            KEY_SFIND()       => "Shifted find key",
            KEY_SHELP()       => "Shifted help key",
            KEY_SHOME()       => "Shifted home key",
            KEY_SIC()         => "Shifted input key",
            KEY_SLEFT()       => "Shifted left arrow key",
            KEY_SMESSAGE()    => "Shifted message key",
            KEY_SMOVE()       => "Shifted move key",
            KEY_SNEXT()       => "Shifted next key",
            KEY_SOPTIONS()    => "Shifted options key",
            KEY_SPREVIOUS()   => "Shifted prev key",
            KEY_SPRINT()      => "Shifted print key",
            KEY_SREDO()       => "Shifted redo key",
            KEY_SREPLACE()    => "Shifted replace key",
            KEY_SRIGHT()      => "Shifted right arrow",
            KEY_SRSUME()      => "Shifted resume key",
            KEY_SSAVE()       => "Shifted save key",
            KEY_SSUSPEND()    => "Shifted suspend key",
            KEY_SUNDO()       => "Shifted undo key",
            KEY_SUSPEND()     => "Suspend key",
            KEY_UNDO()        => "Undo key"
         };
      
         for (my $f = 1; $f <= 64; $f++) {
            $$res{KEY_F($f)} = "KEY_F($f)"   
         }
      
         return $res
      
      }
      

      推荐答案

      实际上看起来是正确的.

      Actually it looks correct.

      使用 strace 运行脚本可以帮助...我这样做是为了查看系统调用:

      Running your script with strace can help... I did this to see the system calls:

      strace -fo strace.out -s 1024 ./foo
      

      ,并且可以看到读取内容,消息等.使用打包库可以对ncurses进行类似的跟踪,尽管打包程序在提供跟踪功能方面并不一致.

      and could see the reads, messages, etc. Getting a similar trace for ncurses could be done using a debug-library, though packagers haven't been consistent about providing one with tracing enabled.

      ü \303\274 (八进制),其Unicode值是 252 (十进制)或 0xfc (十六进制).问题的这一部分似乎错过了这一点:

      ü in UTF-8 is \303\274 (octal), and its Unicode value is 252 (decimal), or 0xfc (hexadecimal). This part of the question seems to have missed that point:

      那只是一个肥腻的NOPE,按ü会看到:

      That's just a fat NOPE, when pressing ü one sees:

      Obtained win_t 0x00fc

      因此将运行正确的代码,但数据为ISO-8859-1,而不是UTF-8.所以是wget_wch表现不佳.因此,这是一个curses配置问题.嗯.

      So the correct code is run, but the data is ISO-8859-1, not UTF-8. So it's wget_wch which behaves badly. So it's a curses config problem. Huh.

      wget_wch 返回(出于实际目的)Unicode值(不是UTF-8字节序列). ISO-8859-1代码160-255碰巧(并非偶然)匹配Unicode代码点,尽管后者在UTF-8中肯定会以不同的方式编码.

      wgetch 将返回UTF-8字节,但Perl脚本仅将其用作后备(因为这将导致Perl脚本将UTF-8字符串转换为Unicode值).

      wgetch would return the UTF-8 bytes, but the Perl script would only use that as a fallback (since that would lead to having the Perl script convert UTF-8 strings to Unicode values).

      这篇关于Perl + Curses:期望从getchar()获得UTF-8编码的多字节字符,但没有得到任何的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆