在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件 [英] Stream-process UTF-16 file with BOM and Unix line endings in Windows perl
问题描述
我需要使用 perl 以 UTF-16 little-endian 编码的 1Gb 文本文件进行流处理,该文件以 unix 样式结尾(即,只有 0x000A,流中没有 0x000D)和开头的 LE BOM.文件在 Windows 上处理(也需要 Unix 解决方案).通过流处理,我的意思是使用 while (<>),逐行读取和写入.有一个命令行单行程序会很好:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;"infile.txt > outfile.txt
I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing.
Would be nice to have a command line one-liner like:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt
测试输入的十六进制转储(两行:a"和b"字母):FF FE 61 00 0A 00 62 00 0A 00
Hex dump of input for testing (two lines: "a" and "b" letters on each): FF FE 61 00 0A 00 62 00 0A 00
像 s/b/c/g 这样的处理应该给出一个输出(b"替换为c"):FF FE 61 00 0A 00 63 00 0A 00
processing like s/b/c/g should give an output ("b" replaced with "c"): FF FE 61 00 0A 00 63 00 0A 00
附注.现在我所有的试验要么是 CRLF 输出有问题(0D 0A 字节输出产生不正确的 unicode 符号,我只需要 0A00 没有 0D00 来保持相同的 unix 风格)或每条新线切换 LE/BE,即相同的a" 一行是输出中奇数行的 6100 和偶数行的 0061.
PS. Right now with all my trials either there's a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same "a" on one line is 6100 on the odd lines and 0061 on the even lines in the output.
推荐答案
我想到的最好的方法是:
The best I've come up with is this:
perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/b/c/g;" <infile.txt >outfile.txt
但请注意,我必须使用 <infile.txt
而不是 infile.txt
以便文件位于 STDIN 上.理论上,open pragma 应该控制魔术 ARGV
使用的编码文件句柄,但在这种情况下我无法使其正常工作.
But note that I had to use <infile.txt
instead of infile.txt
so that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magic ARGV
filehandle, but I can't get it to work correctly in this case.
infile.txt
之间的区别在于打开文件的方式和时间.使用 BEGIN
块中 binmode STDIN
时,文件已经打开,您可以更改编码.
The difference between <infile.txt
and infile.txt
is in how and when the files are opened. With <infile.txt
, the file is connected to standard input, and opened before Perl begins running. When you binmode STDIN
in a BEGIN
block, the file is already open, and you can change the encoding.
当您使用 infile.txt
时,文件名作为命令行参数传递并放置在 @ARGV
数组中.当 BEGIN
块执行时,文件尚未打开,因此您无法设置其编码.从理论上讲,您应该能够说:
When you use infile.txt
, the filename is passed as a command line argument and placed in the @ARGV
array. When the BEGIN
block executes, the file is not open yet, so you can't set its encoding. Theoretically, you ought to be able to say:
use open qw(:std IO :raw:encoding(UTF-16LE));
并让神奇的
处理应用正确的编码.但在这种情况下,我无法让它正常工作.
and have the magic <ARGV>
processing apply the right encoding. But I haven't been able to get that to work right in this case.
这篇关于在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!