RE错误:在Mac OS X非法字节序列 [英] RE error: illegal byte sequence on Mac OS X

查看:1361
本文介绍了RE错误:在Mac OS X非法字节序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想更换一个Makefile字符串在Mac OS X的编译跨到iOS。该字符串嵌入了双引号。该命令是:

I'm trying to replace a string in a Makefile on Mac OS X for cross-compiling to iOS. The string has embedded double quotes. The command is:

sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

和错误是:

sed: RE error: illegal byte sequence

我试图逃避双引号,逗号,破折号,并没有快乐冒号。例如:

I've tried escaping the double quotes, commas, dashes, and colons with no joy. For example:

sed -i "" 's|\"iphoneos-cross\"\,\"llvm-gcc\:\-O3|\"iphoneos-cross\"\,\"clang\:\-Os|g' Configure

有谁知道如何获得 SED 打印非法的字节序列的位置?还是没有人知道非法字节序列是什么?

Does anyone know how to get sed to print the position of the illegal byte sequence? Or does anyone know what the illegal byte sequence is?

推荐答案

使用 以前接受的答案是一个选项,如果你不介意失去了自己的真实语言环境支持(如果你是一个美国的系统上,你永远需要处理外国字符,这可能是罚款。)

Using the formerly accepted answer is an option if you don't mind losing support for your true locale (if you're on a US system and you never need to deal with foreign characters, that may be fine.)

然而,在相同的效果就可以了的临时的一个的单个命令的唯一

However, the same effect can be had ad-hoc for a single command only:

LC_ALL=C sed -i "" 's|"iphoneos-cross","llvm-gcc:-O3|"iphoneos-cross","clang:-Os|g' Configure

请注意:重要的是一个的有效 LC_CTYPE C 的设置,因此< > LC_CTYPE = C SED ... code会的正常的也行,但如果 LC_ALL 恰好是设置(除 C 其他东西),它将覆盖个人 LC _ * -category变量,如 LC_CTYPE 。因此,最稳健的方法是设置 LC_ALL

Note: What matters is an effective LC_CTYPE setting of C, so LC_CTYPE=C sed ... would normally also work, but if LC_ALL happens to be set (to something other than C), it will override individual LC_*-category variables such as LC_CTYPE. Thus, the most robust approach is to set LC_ALL.

不过,(有效)设置 LC_CTYPE C 把字符串好像每个字节是自己的字符没有的执行基于编码规则间pretation),是不考虑作为 - 多字节点播 - UTF -8编码的OS X采用默认情况下,其中的外文字符有无多字节编码

However, (effectively) setting LC_CTYPE to C treats strings as if each byte were its own character (no interpretation based on encoding rules is performed), with no regard for the - multibyte-on-demand - UTF-8 encoding that OS X employs by default, where foreign characters have multibyte encodings.

在一言以蔽之:设置 LC_CTYPE C 会导致外壳和实用程序只承认基本的英文字母作为字母(那些在7位ASCII范围),这样的海外字符。将不被视为字母,导致,例如,大写/小写转换失败。

In a nutshell: setting LC_CTYPE to C causes the shell and utilities to only recognize basic English letters as letters (the ones in the 7-bit ASCII range), so that foreign chars. will not be treated as letters, causing, for instance, upper-/lowercase conversions to fail.

另外,如果不用这可能是罚款的匹配的多字节恩codeD字符,如电子,和只是想到的通过传递这样的字符

Again, this may be fine if you needn't match multibyte-encoded characters such as é, and simply want to pass such characters through.

如果这还不够和/或你想为了解引起需求原来的错误(包括确定哪些输入字节导致问题的原因)和进行编码转换 阅读之下。

If this is insufficient and/or you want to understand the cause of the original error (including determining what input bytes caused the problem) and perform encoding conversions on demand, read on below.

问题是输入文件的编码不匹配shell的。结果
更具体地说,输入文件包含的方式,是不是UTF-8的有效字符连接codeD (如@KlasLindbäck在评论中指出) - 这是在 sed的错误消息试图通过说无效的字节序列

The problem is that the input file's encoding does not match the shell's.
More specifically, the input file contains characters encoded in a way that is not valid in UTF-8 (as @Klas Lindbäck stated in a comment) - that's what the sed error message is trying to say by invalid byte sequence.

最有可能的,你的输入文件使用的单字节8位编码,如 ISO-8859-1 ,常用于恩code西欧语言。

Most likely, your input file uses a single-byte 8-bit encoding such as ISO-8859-1, frequently used to encode "Western European" languages.

示例:

重音信 A 有统一code $ C $连接点取0xE0 (224) - 同在 ISO-8859-1 。但是,由于的 UTF-8 的编码,这个单一$ C $口岸系统重新presented为的 2 的字节为单位的性质 - 0xC3 0XA0 ,而试图通过的字节取0xE0 无效的下UTF-8。

The accented letter à has Unicode codepoint 0xE0 (224) - the same as in ISO-8859-1. However, due to the nature of UTF-8 encoding, this single codepoint is represented as 2 bytes - 0xC3 0xA0, whereas trying to pass the single byte 0xE0 is invalid under UTF-8.

这里的问题使用字符串连接codeD为 ISO-8859-1的示范,用 A 重新presented为的有一个的字节(通过ANSI-C-引用bash的字符串( $'...'使用 \\ X {E0} 以创建字节):

Here's a demonstration of the problem using the string voilà encoded as ISO-8859-1, with the à represented as one byte (via an ANSI-C-quoted bash string ($'...') that uses \x{e0} to create the byte):

注意, SED 命令实际上是简单地通过输入一个空操作,但我们需要它来招惹错误:

Note that the sed command is effectively a no-op that simply passes the input through, but we need it to provoke the error:

  # -> 'illegal byte sequence': byte 0xE0 is not a valid char.
sed 's/.*/&/' <<<$'voil\x{e0}'

要简单的忽略的问题,上面的 LCTYPE = C 方法可用于:

To simply ignore the problem, the above LCTYPE=C approach can be used:

  # No error, bytes are passed through ('á' will render as '?', though).
LC_CTYPE=C sed 's/.*/&/' <<<$'voil\x{e0}'

如果你想确定输入的部分会导致此问题,请尝试以下操作:

If you want to determine which parts of the input cause the problem, try the following:

  # Convert bytes in the 8-bit range (high bit set) to hex. representation.
  # -> 'voil\x{e0}'
iconv -f ASCII --byte-subst='\x{%02x}' <<<$'voil\x{e0}'

的输出会告诉你有十六进制形式高位组(即超过7位ASCII范围字节)的所有字节。 (请注意,但是,这也包括正确连接codeD UTF-8多字节序列 - 将需要更复杂的方法来具体确定无效,在UTF-8字节)

The output will show you all bytes that have the high bit set (bytes that exceed the 7-bit ASCII range) in hexadecimal form. (Note, however, that that also includes correctly encoded UTF-8 multibyte sequences - a more sophisticated approach would be needed to specifically identify invalid-in-UTF-8 bytes.)

按需执行编码转换

标准工具的iconv 可以用来转换成( -t )和/或( -f )编码; 的iconv -l <​​/ code>列出了所有支持的人。

Standard utility iconv can be used to convert to (-t) and/or from (-f) encodings; iconv -l lists all supported ones.

例子:

转换FROM ISO-8859-1 来的编码实际上在外壳(根据 LC_CTYPE ,其中在 UTF-8 默认为基础的),建立在上面的例子:

Convert FROM ISO-8859-1 to the encoding in effect in the shell (based on LC_CTYPE, which is UTF-8-based by default), building on the above example:

  # Converts to UTF-8; output renders correctly as 'voilà'
sed 's/.*/&/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

请注意,这的转换可以让你搭配得当外文字符的:

  # Correctly matches 'à' and replaces it with 'ü': -> 'voilü'
sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')"

要返回转换输入 ISO-8859-1 处理后,只需管结果到另一个的iconv 命令:

To convert the input BACK to ISO-8859-1 after processing, simply pipe the result to another iconv command:

sed 's/à/ü/' <<<"$(iconv -f ISO-8859-1 <<<$'voil\x{e0}')" | iconv -t ISO-8859-1

这篇关于RE错误:在Mac OS X非法字节序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆