使用 Perl 从字符串中删除 BOM [英] Remove BOM from string with Perl

查看:15
本文介绍了使用 Perl 从字符串中删除 BOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题:我正在从 UTF-8 文本文件中读取(并且我通过:encoding(utf-8)"告诉 Perl 我正在这样做).

该文件在十六进制查看器中如下所示:EF BB BF 43 6F 6E 66 65 72 65 6E 63 65

这在打印时翻译为∩╗┐会议".我理解我被警告的宽字符"是 BOM.我想摆脱它(不是因为警告,而是因为它弄乱了我稍后进行的字符串比较).

所以我尝试使用以下代码将其删除,但失败了:

$line =~ s/^xEFxBBxBF//;

谁能告诉我如何从我通过读取 UTF-8 文件的第一行获得的字符串中删除 UTF-8 BOM?

谢谢!

解决方案

EF BB BF 是BOM的UTF-8编码,但是你解码了,所以你必须寻找它的解码形式.BOM 是在文件开头使用的零宽度无间断空间 (U+FEFF),因此可以执行以下任一操作:

s/^x{FEFF}//;s/^N{U+FEFF}//;s/^N{零宽度无间断空间}//;s/^N{BOM}//;# 方便的别名

<小时><块引用>

我知道我被警告的宽字符"是 BOM.我想摆脱它

因为您忘记在输出文件句柄上添加 :encoding 层,所以您得到了宽字符.下面将 :encoding(UTF-8) 添加到 STDIN、STDOUT、STDERR,并使其成为 open() 的默认值.

use open ':std', ':encoding(UTF-8)';

I have the following problem: I am reading from a UTF-8 text file (and I am telling Perl that I am doing so by ":encoding(utf-8)").

The file looks like this in a hex viewer: EF BB BF 43 6F 6E 66 65 72 65 6E 63 65

This translates to "Conference" when printed. I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it (not because of the warning, but because it messes up a string comparison that I undertake later).

So I tried to remove it using the following code, but I fail miserably:

$line =~ s/^xEFxBBxBF//;

Can anyone enlighten me as to how to remove the UTF-8 BOM from a string which I obtained by reading the first line of the UTF-8 file?

Thanks!

解决方案

EF BB BF is the UTF-8 encoding of the BOM, but you decoded it, so you must look for its decoded form. The BOM is a ZERO WIDTH NO-BREAK SPACE (U+FEFF) used at the start of a file, so any of the following will do:

s/^x{FEFF}//;
s/^N{U+FEFF}//;
s/^N{ZERO WIDTH NO-BREAK SPACE}//;
s/^N{BOM}//;   # Convenient alias


I understand the "wide character" which I am being warned about is the BOM. I want to get rid of it

You're getting wide character because you forgot to add an :encoding layer on your output file handle. The following adds :encoding(UTF-8) to STDIN, STDOUT, STDERR, and makes it the default for open().

use open ':std', ':encoding(UTF-8)';

这篇关于使用 Perl 从字符串中删除 BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆