RE 错误:Mac OS X 上的非法字节序列 [英] RE error: illegal byte sequence on Mac OS X

查看:34
本文介绍了RE 错误:Mac OS X 上的非法字节序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试替换 Mac OS X 上的 Makefile 中的字符串,以便交叉编译到 iOS.该字符串已嵌入双引号.命令是:

I'm trying to replace a string in a Makefile on Mac OS X for cross-compiling to iOS. The string has embedded double quotes. The command is:

sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

错误是:

sed: RE error: illegal byte sequence

我曾尝试转义双引号、逗号、破折号和冒号,但没有任何乐趣.例如:

I've tried escaping the double quotes, commas, dashes, and colons with no joy. For example:

sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

我正在调试这个问题.有谁知道如何让 sed 打印非法字节序列的位置?或者有谁知道非法字节序列是什么?

I'm having a heck of a time debugging the issue. Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?

推荐答案

显示症状的示例命令:sed 's/./@/' <<<$'xfc' 失败,因为字节 0xfc 不是有效的 UTF-8 字符.
请注意,相比之下,GNU sed(Linux,但也可在 macOS 上安装)只会传递无效字节,而不会报告错误.

A sample command that exhibits the symptom: sed 's/./@/' <<<$'xfc' fails, because byte 0xfc is not a valid UTF-8 char.
Note that, by contrast, GNU sed (Linux, but also installable on macOS) simply passes the invalid byte through, without reporting an error.

如果您不介意失去对真实语言环境的支持,可以选择使用以前接受的答案(如果您使用的是美国系统并且您从不需要处理外国字符,那可能没问题.)

Using the formerly accepted answer is an option if you don't mind losing support for your true locale (if you're on a US system and you never need to deal with foreign characters, that may be fine.)

但是,对于单个命令同样的效果可以ad-hoc:

However, the same effect can be had ad-hoc for a single command only:

LC_ALL=C sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

注意:重要的是C有效 LC_CTYPE设置,所以LC_CTYPE=C sed ...通常也能工作,但如果 LC_ALL 碰巧被设置(除了 C),它会覆盖个别的 LC_*-类别变量,例如LC_CTYPE.因此,最可靠的方法是设置 LC_ALL.

Note: What matters is an effective LC_CTYPE setting of C, so LC_CTYPE=C sed ... would normally also work, but if LC_ALL happens to be set (to something other than C), it will override individual LC_*-category variables such as LC_CTYPE. Thus, the most robust approach is to set LC_ALL.

但是,(有效地)将 LC_CTYPE 设置为 C 会将字符串视为每个字节都是它自己的字符(noem> 执行基于编码规则的解释),不考虑 - 多字节按需 - UTF-8 编码,OS X 默认采用,其中 <强>外来字符具有多字节编码.

However, (effectively) setting LC_CTYPE to C treats strings as if each byte were its own character (no interpretation based on encoding rules is performed), with no regard for the - multibyte-on-demand - UTF-8 encoding that OS X employs by default, where foreign characters have multibyte encodings.

简而言之:LC_CTYPE 设置为 C 会导致 shell 和实用程序仅将基本英文字母识别为字母(7 位 ASCII 范围),以便 外来字符.不会被视为字母,例如导致大写/小写转换失败.

In a nutshell: setting LC_CTYPE to C causes the shell and utilities to only recognize basic English letters as letters (the ones in the 7-bit ASCII range), so that foreign chars. will not be treated as letters, causing, for instance, upper-/lowercase conversions to fail.

同样,如果您不需要匹配多字节编码的字符(例如é),而只是想传递这些字符,这可能没问题.em>.

Again, this may be fine if you needn't match multibyte-encoded characters such as é, and simply want to pass such characters through.

如果这还不够和/或您想了解原始错误的原因(包括确定导致问题的输入字节)并根据需要执行编码转换, 继续阅读下面的内容.

If this is insufficient and/or you want to understand the cause of the original error (including determining what input bytes caused the problem) and perform encoding conversions on demand, read on below.

问题是输入文件的编码与shell的不匹配.
更具体地说,输入文件包含以在 UTF-8 中无效的方式编码的字符(如@Klas Lindbäck 在评论中所述) - 这就是 sed 错误消息试图通过无效字节序列来表达.

The problem is that the input file's encoding does not match the shell's.
More specifically, the input file contains characters encoded in a way that is not valid in UTF-8 (as @Klas Lindbäck stated in a comment) - that's what the sed error message is trying to say by invalid byte sequence.

很可能,您的输入文件使用单字节 8 位编码,例如 ISO-8859-1,经常用于对西欧"语言进行编码.

Most likely, your input file uses a single-byte 8-bit encoding such as ISO-8859-1, frequently used to encode "Western European" languages.

示例:

重音字母 à 具有 Unicode 代码点 0xE0 (224) - 与 ISO-8859-1 中的相同.然而,由于 UTF-8 编码的性质,这个单一的代码点被表示为 2 个字节 - 0xC3 0xA0,而试图通过单字节 0xE0 在 UTF-8 下无效.

The accented letter à has Unicode codepoint 0xE0 (224) - the same as in ISO-8859-1. However, due to the nature of UTF-8 encoding, this single codepoint is represented as 2 bytes - 0xC3 0xA0, whereas trying to pass the single byte 0xE0 is invalid under UTF-8.

这是一个问题的演示,使用字符串 voilà 编码为 ISO-8859-1,带有 à 表示为 one 字节(通过使用 x{e0} 的 ANSI-C 引用的 bash 字符串 ($'...')代码>来创建字节):

Here's a demonstration of the problem using the string voilà encoded as ISO-8859-1, with the à represented as one byte (via an ANSI-C-quoted bash string ($'...') that uses x{e0} to create the byte):

请注意,sed 命令实际上是一个简单的传递输入的空操作,但我们需要它来引发错误:

Note that the sed command is effectively a no-op that simply passes the input through, but we need it to provoke the error:

  # -> 'illegal byte sequence': byte 0xE0 is not a valid char.
sed 's/.*/&/' <<<$'voilx{e0}'

为了简单地忽略问题,可以使用上面的LCTYPE=C方法:

  # No error, bytes are passed through ('á' will render as '?', though).
LC_CTYPE=C sed 's/.*/&/' <<<$'voilx{e0}'

如果您想确定输入的哪些部分导致问题,请尝试以下操作:

If you want to determine which parts of the input cause the problem, try the following:

  # Convert bytes in the 8-bit range (high bit set) to hex. representation.
  # -> 'voilx{e0}'
iconv -f ASCII --byte-subst='x{%02x}' <<<$'voilx{e0}'

输出将以十六进制形式显示所有设置了高位的字节(超出 7 位 ASCII 范围的字节).(但请注意,这也包括正确编码的 UTF-8 多字节序列 - 需要一种更复杂的方法来专门识别无效的 UTF-8 字节.)

The output will show you all bytes that have the high bit set (bytes that exceed the 7-bit ASCII range) in hexadecimal form. (Note, however, that that also includes correctly encoded UTF-8 multibyte sequences - a more sophisticated approach would be needed to specifically identify invalid-in-UTF-8 bytes.)

按需执行编码转换:

标准实用程序 iconv 可用于转换为 (-t) 和/或从 (-f) 编码;iconv -l 列出所有支持的.

Standard utility iconv can be used to convert to (-t) and/or from (-f) encodings; iconv -l lists all supported ones.

示例:

将 FROM ISO-8859-1 转换为 shell 中有效的编码(基于 LC_CTYPE,即 UTF-8-基于默认),建立在上面的例子:

Convert FROM ISO-8859-1 to the encoding in effect in the shell (based on LC_CTYPE, which is UTF-8-based by default), building on the above example:

  # Converts to UTF-8; output renders correctly as 'voilà'
sed 's/.*/&/' <<<"$(iconv -f ISO-8859-1 <<<$'voilx{e0}')"

请注意,此转换允许您正确匹配外来字符:

  # Correctly matches 'à' and replaces it with 'ü': -> 'voilü'
sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voilx{e0}')"

要在处理后将输入 BACK 转换为 ISO-8859-1,只需将结果通过管道传输到另一个 iconv 命令:

To convert the input BACK to ISO-8859-1 after processing, simply pipe the result to another iconv command:

sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voilx{e0}')" | iconv -t ISO-8859-1

这篇关于RE 错误:Mac OS X 上的非法字节序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆