使用awk去除Byte-order标记 [英] Using awk to remove the Byte-order mark

查看:32
本文介绍了使用awk去除Byte-order标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

awk 脚本(大概是单行)如何删除 BOM 看起来像什么?

How would an awk script (presumably a one-liner) for removing a BOM look like?

规格:

  • 打印第一行之后的每一行 (NR > 1)
  • 对于第一行:如果它以 #FE #FF#FF #FE 开头,删除它们并打印其余部分
  • print every line after the first (NR > 1)
  • for the first line: If it starts with #FE #FF or #FF #FE, remove those and print the rest

推荐答案

试试这个:

awk 'NR==1{sub(/^xefxbbxbf/,"")}{print}' INFILE > OUTFILE

在第一条记录(行)上,删除 BOM 字符.打印每条记录.

On the first record (line), remove the BOM characters. Print every record.

或者稍微短一点,使用awk中的默认操作是打印记录的知识:

Or slightly shorter, using the knowledge that the default action in awk is to print the record:

awk 'NR==1{sub(/^xefxbbxbf/,"")}1' INFILE > OUTFILE

1 是最短的条件,总是评估为真,因此打印每条记录.

1 is the shortest condition that always evaluates to true, so each record is printed.

享受吧!

-- 附录 --

Unicode 字节顺序标记 (BOM) 常见问题 包括下表列出了确切的 BOM每个编码的字节数:

Unicode Byte Order Mark (BOM) FAQ includes the following table listing the exact BOM bytes for each encoding:

Bytes         |  Encoding Form
--------------------------------------
00 00 FE FF   |  UTF-32, big-endian
FF FE 00 00   |  UTF-32, little-endian
FE FF         |  UTF-16, big-endian
FF FE         |  UTF-16, little-endian
EF BB BF      |  UTF-8

因此,您可以从上表中看到 xefxbbxbf 如何对应 EF BB BF UTF-8 BOM 字节.

Thus, you can see how xefxbbxbf corresponds to EF BB BF UTF-8 BOM bytes from the above table.

这篇关于使用awk去除Byte-order标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆