在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件 [英] Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

查看:24
本文介绍了在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用 perl 以 UTF-16 little-endian 编码的 1Gb 文本文件进行流处理,该文件以 unix 样式结尾(即,只有 0x000A,流中没有 0x000D)和开头的 LE BOM.文件在 Windows 上处理(也需要 Unix 解决方案).通过流处理,我的意思是使用 while (<>),逐行读取和写入.有一个命令行单行程序会很好:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;"infile.txt > outfile.txt

I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing. Would be nice to have a command line one-liner like:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt

测试输入的十六进制转储(两行:a"和b"字母):FF FE 61 00 0A 00 62 00 0A 00

Hex dump of input for testing (two lines: "a" and "b" letters on each): FF FE 61 00 0A 00 62 00 0A 00

s/b/c/g 这样的处理应该给出一个输出(b"替换为c"):FF FE 61 00 0A 00 63 00 0A 00

processing like s/b/c/g should give an output ("b" replaced with "c"): FF FE 61 00 0A 00 63 00 0A 00

附注.现在我所有的试验要么是 CRLF 输出有问题(0D 0A 字节输出产生不正确的 unicode 符号,我只需要 0A00 没有 0D00 来保持相同的 unix 风格)或每条新线切换 LE/BE,即相同的a" 一行是输出中奇数行的 6100 和偶数行的 0061.

PS. Right now with all my trials either there's a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same "a" on one line is 6100 on the odd lines and 0061 on the even lines in the output.

推荐答案

我想到的最好的方法是:

The best I've come up with is this:

perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/b/c/g;" <infile.txt >outfile.txt

但请注意,我必须使用 <infile.txt 而不是 infile.txt 以便文件位于 STDIN 上.理论上,open pragma 应该控制魔术 ARGV 使用的编码文件句柄,但在这种情况下我无法使其正常工作.

But note that I had to use <infile.txt instead of infile.txt so that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magic ARGV filehandle, but I can't get it to work correctly in this case.

infile.txt 之间的区别在于打开文件的方式和时间.使用 ,文件连接到标准输入,并在 Perl 开始运行之前打开.当您在 BEGIN 块中 binmode STDIN 时,文件已经打开,您可以更改编码.

The difference between <infile.txt and infile.txt is in how and when the files are opened. With <infile.txt, the file is connected to standard input, and opened before Perl begins running. When you binmode STDIN in a BEGIN block, the file is already open, and you can change the encoding.

当您使用 infile.txt 时,文件名作为命令行参数传递并放置在 @ARGV 数组中.当 BEGIN 块执行时,文件尚未打开,因此您无法设置其编码.从理论上讲,您应该能够说:

When you use infile.txt, the filename is passed as a command line argument and placed in the @ARGV array. When the BEGIN block executes, the file is not open yet, so you can't set its encoding. Theoretically, you ought to be able to say:

use open qw(:std IO :raw:encoding(UTF-16LE));

并让神奇的 处理应用正确的编码.但在这种情况下,我无法让它正常工作.

and have the magic <ARGV> processing apply the right encoding. But I haven't been able to get that to work right in this case.

这篇关于在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆