sed(和其他)会损坏非ASCII文件吗? [英] Will sed (and others) corrupt non-ASCII files?

查看:129
本文介绍了sed(和其他)会损坏非ASCII文件吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我编写了一些脚本来处理文件,例如使用sed进行搜索/替换,并且文件可以位于各种字符集中,那么文件是否会损坏?

If I write some scripts that manipulate files like doing some search/replace with sed, and the files can be in various charsets, can the files be corrupted?

我要替换的文本是ASCII,并且也只出现在仅包含ASCII的文件中的行上,其余的行包含其他字符集中的字符。

The text I wish to replace is ASCII and also only occurs on lines in the files that contain only ASCII but the rest of the lines contain characters in other charsets.

推荐答案

如果字符集是单字节编码(例如ISO-8859-n系列)或UTF-8,其中换行符与ASCII中的换行符相同,而NUL字符( \0 )不会发生,您的操作可能会起作用。如果文件使用UTF-16,则不会(因为NUL)。为什么它可以用于简单搜索和替换ASCII字符串的原因是:我们假设您的编码是ASCII的超集,并且对于像这样的简单匹配, sed 在大多数情况下都适用

If your charsets are single-byte encodings (like the ISO-8859-n family) or UTF-8, where the newline character is the same as in ASCII, and the NUL character (\0) doesn't occur, your operation is likely to work. If the files use UTF-16, it will not (because of NULs). Why it should work for simple search and replacement of ASCII strings is: we assumed, your encoding is a superset of ASCII and for a simple match like this, sed will mostly work on the byte level and just replace one byte sequence with another.

但是:对于更复杂的操作,例如当替换或替换字符串包含特殊字符时,结果可能会有所不同。例如,如果控制台编码/语言环境与文件编码不同,则在命令行上输入的带重音字符可能不适合文件中的编码。可以解决这个问题,但需要注意。

But: with more complex operations, like when your replaced or replacement strings contain special characters, your results may vary. For example, the accented characters you enter on your command line might not fit the encoding in your file if console encoding/locale is different from file encoding. One can go around this, but it requires care.

例如, sed 中的某些操作取决于您的语言环境哪些字符被视为字母数字。例如,比较以下在波兰UTF-8语言环境和使用ASCII的C语言环境中执行的替换:

Some operations in sed depend on your locale, for example which characters are considered alphanumeric. Compare for example the following replacement performed in Polish UTF-8 locale and in C locale which uses ASCII:

$ echo "gęś gęgała" | LC_ALL=pl_PL.UTF-8 sed -e 's/[[:alnum:]]/X/g'
XXX XXXXXX
$ echo "gęś gęgała" | LC_ALL=C sed -e 's/[[:alnum:]]/X/g'
Xęś XęXXłX

但是,如果您只想替换文字字符串,它将按预期工作:

But if you only want to replace literal strings, it works as expected:

$ echo "gęś gęgała" | LC_ALL=pl_PL.UTF-8 sed -e 's/g/G/g'
Gęś GęGała
$ echo "gęś gęgała" | LC_ALL=C sed -e 's/g/G/g'
Gęś GęGała

如您所见,结果不同,因为根据区域设置对重音字符的处理不同。简而言之:替换文字ASCII字符串很可能可以正常工作,更复杂的操作需要研究并且可能行不通。

As you see, the results differ because accented characters are treated differently depending on locale. In short: replacements of literal ASCII strings will most probably work OK, more complex operations need looking into and may either work or not.

这篇关于sed(和其他)会损坏非ASCII文件吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆