为什么 sed 会因国际字符而失败以及如何解决? [英] Why does sed fail with International characters and how to fix?
问题描述
GNU sed 4.1.5 版
似乎因国际字符而失败.这是我的输入文件:
GNU sed version 4.1.5
seems to fail with International chars. Here is my input file:
Gras Och Stenar Trad - From Moja to Minneapolis DVD [G2007DVD] 7812 | X
<br>
Gras Och Stenar Trad - From Möja to Minneapolis DVD [G2007DVD] 7812 | Y
(注意第二行的变音符号.)
(Note the umlaut in the second line.)
当我这样做时
sed 's/.*| //' < in
我希望只看到 X
和 Y
,因为我已要求删除直到 '|'
的所有字符和超越它的空间.相反,我得到:
I would expect to see only the X
and Y
, as I've asked to remove ALL chars up to the '|'
and space beyond it. Instead, I get:
X<br>
Gras Och Stenar Trad - From M? Y
我知道我可以使用 tr 删除国际字符.首先,但有没有办法只使用 sed?
I know I can use tr to remove the International chars. first, but is there a way to just use sed?
推荐答案
我认为如果文件的输入编码与您的环境的首选编码不同,则会发生错误.
I think the error occurs if the input encoding of the file is different from the preferred encoding of your environment.
示例:in
是 UTF-8
Example: in
is UTF-8
$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Y
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y
UTF-8 可以安全地解释为 ISO-8859-1,你会得到奇怪的字符,但除此之外一切都很好.
UTF-8 can safely be interpreted as ISO-8859-1, you'll get strange characters but apart from that everything is fine.
示例:in
为 ISO-8859-1
Example: in
is ISO-8859-1
$ LANG=de_DE.UTF-8 sed 's/.*| //' < in
X
Gras Och Stenar Trad - From MöY
$ LANG=de_DE.iso88591 sed 's/.*| //' < in
X
Y
ISO-8859-1 无法解释为 UTF-8,解码输入文件失败.奇怪的匹配可能是由于 sed 试图恢复而不是完全失败.
ISO-8859-1 cannot be interpreted as UTF-8, decoding the input file fails. The strange match is probably due to the fact that sed tries to recover rather than fail completely.
答案基于 Debian Lenny/Sid 和 sed 4.1.5.
The answer is based on Debian Lenny/Sid and sed 4.1.5.
这篇关于为什么 sed 会因国际字符而失败以及如何解决?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!