在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件 [英] Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

查看：24 发布时间：2021/9/15 19:39:31 perl unicode utf-16

本文介绍了在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要使用 perl 以 UTF-16 little-endian 编码的 1Gb 文本文件进行流处理，该文件以 unix 样式结尾(即，只有 0x000A，流中没有 0x000D)和开头的 LE BOM.文件在 Windows 上处理(也需要 Unix 解决方案).通过流处理，我的意思是使用 while (<>)，逐行读取和写入.有一个命令行单行程序会很好:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;"infile.txt > outfile.txt

I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing. Would be nice to have a command line one-liner like:
perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt

测试输入的十六进制转储(两行:a"和b"字母):FF FE 61 00 0A 00 62 00 0A 00

Hex dump of input for testing (two lines: "a" and "b" letters on each): FF FE 61 00 0A 00 62 00 0A 00

像 s/b/c/g 这样的处理应该给出一个输出(b"替换为c"):FF FE 61 00 0A 00 63 00 0A 00

processing like s/b/c/g should give an output ("b" replaced with "c"): FF FE 61 00 0A 00 63 00 0A 00

附注.现在我所有的试验要么是 CRLF 输出有问题(0D 0A 字节输出产生不正确的 unicode 符号，我只需要 0A00 没有 0D00 来保持相同的 unix 风格)或每条新线切换 LE/BE，即相同的a" 一行是输出中奇数行的 6100 和偶数行的 0061.

PS. Right now with all my trials either there's a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same "a" on one line is 6100 on the odd lines and 0061 on the even lines in the output.

推荐答案

我想到的最好的方法是:

The best I've come up with is this:

perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/b/c/g;" <infile.txt >outfile.txt

但请注意，我必须使用 <infile.txt 而不是 infile.txt 以便文件位于 STDIN 上.理论上，open pragma 应该控制魔术 ARGV 使用的编码文件句柄，但在这种情况下我无法使其正常工作.

But note that I had to use <infile.txt instead of infile.txt so that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magic ARGV filehandle, but I can't get it to work correctly in this case.

和 infile.txt 之间的区别在于打开文件的方式和时间.使用，文件连接到标准输入，并在 Perl 开始运行之前打开.当您在 BEGIN 块中 binmode STDIN 时，文件已经打开，您可以更改编码.


The difference between <infile.txt and infile.txt is in how and when the files are opened.  With <infile.txt, the file is connected to standard input, and opened before Perl begins running.  When you binmode STDIN in a BEGIN block, the file is already open, and you can change the encoding.
当您使用 infile.txt 时，文件名作为命令行参数传递并放置在 @ARGV 数组中.当 BEGIN 块执行时，文件尚未打开，因此您无法设置其编码.从理论上讲，您应该能够说:
When you use infile.txt, the filename is passed as a command line argument and placed in the @ARGV array.  When the BEGIN block executes, the file is not open yet, so you can't set its encoding.  Theoretically, you ought to be able to say:
use open qw(:std IO :raw:encoding(UTF-16LE));

并让神奇的  处理应用正确的编码.但在这种情况下，我无法让它正常工作.
and have the magic <ARGV> processing apply the right encoding.  But I haven't been able to get that to work right in this case.

                        这篇关于在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件 [英] Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Windows perl 中流处理带有 BOM 和 Unix 行结尾的 UTF-16 文件 [英] Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭