对SHIFT_JIS编码的日语文件进行Perl文件处理 [英] Perl file processing on SHIFT_JIS encoded Japanese files

查看:158
本文介绍了对SHIFT_JIS编码的日语文件进行Perl文件处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组来自Windows的SHIFT_JIS(日语)编码的csv文件,我正试图在运行Perl v5.10.1的Linux服务器上使用正则表达式进行字符串替换.

I have a set of SHIFT_JIS (Japanese) encoded csv file from Windows, which I am trying to process on a Linux server running Perl v5.10.1 using regular expressions to make string replacements.

这是我的要求: 我希望Perl脚本的正则表达式是人类可读的(至少对于日本人而言) IE.像这样: s/北/0/g; 而不是乱七八糟的代码 s/\ x {4eba}/0/g;

Here is my requirement: I want the Perl script’s regular expressions being human readable (at least to a Japanese person) Ie. like this: s/北/0/g; Instead of it littered with some hex codes s/\x{4eba}/0/g;

现在,我正在Windows上的Notepad ++中编辑Perl脚本,并将需要从csv数据文件中搜索的字符串粘贴到Perl脚本中.

Right now, I am editing the Perl script in Notepad++ on Windows, and pasting in the string I need to search for from the csv data file onto the Perl script.

我下面有以下工作测试脚本:

I have the following working test script below:

use strict;
use warnings;
use utf8;

open (IN1,  "<:encoding(shift_jis)", "${work_dir}/tmp00.csv") or die "Error: tmp00.csv\n";
open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/tmp01.csv") or die "Error: tmp01.csv\n";

while (<IN1>)
{
    print $_ . "\n";
    chomp;
    s/北/0/g;
    s/10:00/9:00/g;     
    print OUT1 "$_\n";
}    

close IN1;
close OUT1;

这将成功地将csv文件中的10:00替换为9:00,但是问题是我无法用0替换北(即North),除非顶部也包含utf8.

This would successfully replace the 10:00 with 9:00 in the csv file, but the issue is I was unable to replace北 (ie. North) with 0 unless use utf8 is also included at the top.

问题:

1)在打开的文档中, http://perldoc.perl.org/functions/open.html ,除非隐式使用utf8,否则我不认为它是必需的?

1) In the open documentation, http://perldoc.perl.org/functions/open.html, I didn’t see use utf8 as a requirement, unless it is implicit?

a)如果仅使用utf8,则循环中的第一个print语句会将垃圾字符打印到我的xterm屏幕上.

a) If I had use utf8 only, then the first print statement in the loop would print garbage character to my xterm screen.

b)如果仅使用:encoding(shift_jis)调用open,则循环中的第一个print语句会将日语字符打印到我的xterm屏幕上,但不会发生替换.没有警告未指定使用utf8.

b) If I had called open with :encoding(shift_jis) only, then the first print statement in the loop would print Japanese character to my xterm screen, but the replacement would not happen. There is no warning that use utf8 was not specified.

c)如果同时使用a)和b),则此示例有效.

c) If I used both a) and b), then this example works.

在此Perl脚本中,使用utf8"如何修改用:enoding(shift_jis)调用open的行为?

How does "use utf8" modify the behavior of calling open with :enoding(shift_jis) in this Perl script?

2)我还尝试了在未指定任何编码的情况下打开文件,Perl不会将文件字符串视为原始字节,并且如果我粘贴到脚本中的字符串在其中,则能够以这种方式执行正则表达式匹配与原始数据文件中的文本相同的编码?我能够以这种方式更早地进行文件名替换,而无需指定任何编码(请在此处参考我的相关文章:

2) I also tried to open the file without any encoding specified, wouldn’t Perl treat the file strings as raw bytes, and be able to perform regular expression match that way if the strings I pasted in the script, is in the same encoding as the text in the original data file? I was able to do file name replacement earlier this way without specifying any encoding whatsoever (please refer to my related post here: Perl Japanese to English filename replacement).

谢谢.

更新1

UPDATES 1

在Perl中测试一个简单的本地化示例以日语替换文件名和文件文本

在Windows XP中,从.csv数据文件中复制南字符并将其复制到剪贴板,然后将其用作文件名(即南.txt)和文件内容(南).在Notepad ++中,以UTF-8编码读取文件将显示x93xEC,而在SHIFT_JIS显示南读取文件.

In Windows XP, copy the 南 character from within a .csv data file and copy to the clipboard, then use it as both the file name (ie. 南.txt) and file content (南). In Notepad++ , reading the file under encoding UTF-8 shows x93xEC, reading it under SHIFT_JIS displays南.

脚本:

使用以下Perl脚本south.pl,该脚本将在具有Perl 5.10的Linux服务器上运行

Use the following Perl script south.pl, which will be run on a Linux server with Perl 5.10

#!/usr/bin/perl
use feature qw(say);

use strict;
use warnings;
use utf8;
use Encode qw(decode encode);

my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";

# forward declare the function prototypes
sub fileProcess;

opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};

# readdir OPTION 1 - shift_jis
#my @files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename    could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");                    

# readdir OPTION 2 - utf8
my @files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)");                           # setting display to output utf8

say @files;                                 

# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();

closedir(DIR);

exit;

sub fileNameTranslate
{

    foreach (@files) 
    {
        my $original_file = $_; 
        #print "original_file: " . "$original_file" . "\n";     
        s/南/south/;     

        my $new_file = $_;
        # print "new_file: " . "$_" . "\n";

        if ($new_file ne $original_file)
        {
            print "Rename " . $original_file . " to \n\t" . $new_file . "\n";
            rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!\n";
        }
    }
}

sub fileProcess
{

    #   file process OPTION 3, open file as shift_jis, the search and replace would work
    #   open (IN1,  "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    #   open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";  

    #   file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1,  "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";   

    while (<IN1>)
    {
        print $_ . "\n";
        chomp;

        s/南/south/g;


        print OUT1 "$_\n";
    }

    close IN1;
    close OUT1; 
}

结果:

(BAD)取消注释选项1和3,(注释选项2和4) 设置:Readdir编码,SHIFT_JIS;文件打开编码SHIFT_JIS 结果:文件名替换失败. 错误:utf8"\ x93"在.//south.pl第68行处未映射到Unicode. \ x93

(BAD) Uncomment Option 1 and 3, (Comment Option 2 and 4) Setup: Readdir encoding, SHIFT_JIS; file open encoding SHIFT_JIS Result: file name replacement failed.. Error: utf8 "\x93" does not map to Unicode at .//south.pl line 68. \x93

(BAD)取消注释选项2和4(注释选项1和3) 设置:Readdir编码,utf8;文件打开编码utf8 结果:文件名替换成功,生成了south.txt 但是south1.txt文件内容替换失败,它具有内容\ x93(). 错误:"\ x {fffd}"未映射到.//south.pl第25行的shiftjis. ... -Ao?=(Bx {fffd} .txt

(BAD) Uncomment Option 2 and 4 (Comment Option 1 and 3) Setup: Readdir encoding, utf8; file open encoding utf8 Result: file name replacement worked, south.txt generated But south1.txt file content replacement failed , it has the content \x93 (). Error: "\x{fffd}" does not map to shiftjis at .//south.pl line 25. ... -Ao?= (Bx{fffd}.txt

(良好)取消注释选项2和3,(注释选项1和4) 设置:Readdir编码,utf8;文件打开编码SHIFT_JIS 结果:文件名替换成功,生成了south.txt South1.txt文件的内容替换工作正常,它具有南方的内容.

(GOOD) Uncomment Option 2 and 3, (Comment Option 1 and 4) Setup: Readdir encoding, utf8; file open encoding SHIFT_JIS Result: file name replacement worked, south.txt generated South1.txt file content replacement worked, it has the content south.

结论:

在此示例中,我必须使用其他编码方案才能正常工作. Readdir utf8,文件处理SHIFT_JIS,因为csv文件的内容是SHIFT_JIS编码的.

I had to use different encoding scheme for this example to work properly. Readdir utf8, and file processing SHIFT_JIS, as the content of the csv file was SHIFT_JIS encoded.

推荐答案

一个好的开始是阅读 utf8模块的文档.上面写着:

A good place to start would be to read the documentation for the utf8 module. Which says:

use utf8编译指示告诉Perl解析器允许UTF-8在 当前词法范围内的程序文本(允许EBCDIC上的UTF-EBCDIC 基础平台). no utf8编译指示告诉Perl切换回 将源文本视为当前词汇中的文字字节 范围.

The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC based platforms). The no utf8 pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope.

如果您的代码中没有use utf8,则Perl编译器将假定您的源代码采用系统的本机单字节编码.字符北"将毫无意义.添加该编译指示将告诉Perl您的代码包含Unicode字符,并且一切开始正常工作.

If you don't have use utf8 in your code, then the Perl compiler assumes that your source code is in your system's native single-byte encoding. And the character '北' will make little sense. Adding the pragma tells Perl that your code includes Unicode characters and everything starts to work.

这篇关于对SHIFT_JIS编码的日语文件进行Perl文件处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆