使用iconv将UTF8转换为UTF16 [英] Convert UTF8 to UTF16 using iconv

查看:494
本文介绍了使用iconv将UTF8转换为UTF16的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用iconv从UTF16转换为UTF8时,一切正常,但反之亦然,它不工作。
我有这些文件:

  a-16.strings:Little-endian UTF-16 Unicode c程序文本
a-8.strings:UTF-8 Unicode c程序文本,有很长的行

文本在编辑器中看起来OK。当我运行这个:

  iconv -f UTF-8 -t UTF-16LE a-8.strings> b-16.strings 

然后我得到这个结果:

  b-16.strings:data 
a-16.strings:Little-endian UTF-16 Unicode c程序文本
a-8.strings :UTF-8 Unicode c程序文本,有很长的行

文件实用程序不显示预期的文件格式,文本在编辑器中看起来不好。可能是iconv不创建正确的BOM?我在MAC命令行上运行它。



为什么b-16不是正确的UTF-16LE格式?是否有另一种方法将utf8转换为utf16?



更详细的说明。

  $ iconv -f UTF-8 -t UTF-16LE a-8.strings> b-16le-BAD-fromUTF8.strings 
$ iconv -f UTF-8 -t UTF-16 a-8.strings> b-16be.strings
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings> b-16le-BAD-fromUTF16BE.strings

$ file * s
a-16.strings:Little-endian UTF-16 Unicode c程序文本,带有很长的行
a-8.strings:UTF-8 Unicode c程序文本,带有很长的行
b-16be.strings:Big-endian UTF-16 Unicode c程序文本,带很长的行
b-16le BAD-fromUTF16BE.strings:data
b-16le-BAD-fromUTF8.strings:data


$ od -c a-16.strings |头
0000000 377 376 / \0 * \0 \0 \f 001 E \0 S \0 K \0

$ od -c a-8 .strings |头
0000000 / * * *Č** ESKY(JVO

$ od -c b-16be.strings | head
0000000 376 377 \0 / \0 * \0 * \0 * \0 001 \f \0 E

$ od -c b-16le-BAD-fromUTF16BE.strings | head
0000000 / \ 0 * \0 * \0 * \0 \0 \f 001 E \0 S \0

$ od -c b-16le-BAD-fromUTF8.strings | head
0000000 / \0 * \0 * \0 * \0 \0 \f 001 E \0 S \0

很明显,当我运行转换为UTF-16LE时,BOM丢失
任何帮助?

解决方案

UTF-16LE 告诉 iconv -endian UTF-16 没有一个BOM(字节顺序标记)。显然,它假定自从你指定 LE 后,BOM不是必需的。



UTF-16 指示它生成UTF-16文本(以本地计算机的字节顺序)

使用BOM生成big-endian UTF-16,但我可能只是缺少一些东西。



我发现 file 命令不能识别没有BOM的UTF-16文本,您的编辑器也不能。但是如果你运行 iconv -f UTF-16LE -t UTF_8 b-16字符串,你应该得到一个有效的原始文件的UTF-8版本。



尝试对文件运行 od -c 以查看其实际内容。



UPDATE:



看起来你是一个大端机器(x86是little-endian)尝试生成带有BOM的小端UTF-16文件。那是对的吗?据我所知, iconv 不会直接这样做。但这应该工作:

 (printf\xff\xfe; iconv -f utf-8 -t utf- 16le UTF-8-FILE)> UTF-16-FILE 

printf 可能取决于您的区域设置;我有 LANG = en_US.UTF-8



(任何人都可以提出更优雅的解决方案? p>

另一种解决方法,如果您知道 -t utf-16 :

  iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv = swab 2> / dev / null 


When I use iconv to convert from UTF16 to UTF8 then all is fine but vice versa it does not work. I have these files:

a-16.strings:    Little-endian UTF-16 Unicode c program text
a-8.strings:     UTF-8 Unicode c program text, with very long lines

The text look OK in editor. When I run this:

iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16.strings

Then I get this result:

b-16.strings:    data
a-16.strings:    Little-endian UTF-16 Unicode c program text
a-8.strings:     UTF-8 Unicode c program text, with very long lines

The file utility does not show expected file format and the text does not look good in editor either. Could it be that iconv does not create proper BOM? I run it on MAC command line.

Why is not the b-16 in proper UTF-16LE format? Is there another way of converting utf8 to utf16?

More elaboration is bellow.

$ iconv -f UTF-8 -t UTF-16LE a-8.strings > b-16le-BAD-fromUTF8.strings
$ iconv -f UTF-8 -t UTF-16 a-8.strings > b-16be.strings 
$ iconv -f UTF-16 -t UTF-16LE b-16be.strings > b-16le-BAD-fromUTF16BE.strings

$ file *s
a-16.strings:                   Little-endian UTF-16 Unicode c program text, with very long lines
a-8.strings:                    UTF-8 Unicode c program text, with very long lines
b-16be.strings:                 Big-endian UTF-16 Unicode c program text, with very long lines
b-16le-BAD-fromUTF16BE.strings: data
b-16le-BAD-fromUTF8.strings:    data


$ od -c a-16.strings | head
0000000  377 376   /  \0   *  \0      \0  \f 001   E  \0   S  \0   K  \0

$ od -c a-8.strings | head 
0000000    /   *   *   *       Č  **   E   S   K   Y       (   J   V   O

$ od -c b-16be.strings | head
0000000  376 377  \0   /  \0   *  \0   *  \0   *  \0     001  \f  \0   E

$ od -c b-16le-BAD-fromUTF16BE.strings | head                                
0000000    /  \0   *  \0   *  \0   *  \0      \0  \f 001   E  \0   S  \0

$ od -c b-16le-BAD-fromUTF8.strings | head
0000000    /  \0   *  \0   *  \0   *  \0      \0  \f 001   E  \0   S  \0

It is clear the BOM is missing whenever I run conversion to UTF-16LE. Any help on this?

解决方案

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.

UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.

If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.

I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.

Try running od -c on the files to see their actual contents.

UPDATE :

It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:

( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE

The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.

(Can anyone suggest a more elegant solution?)

Another workaround, if you know the endianness of the output produced by -t utf-16:

iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null

这篇关于使用iconv将UTF8转换为UTF16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆