如何在Linux中使用POSIX方法从文件读取Unicode-16字符串? [英] How do I read Unicode-16 strings from a file using POSIX methods in Linux?
问题描述
我有一个包含UNICODE-16字符串的文件,我想将其读入Linux程序.这些字符串是从Windows的内部WCHAR格式原始写入的. (Windows是否始终使用UTF-16?例如日语版本)
I have a file containing UNICODE-16 strings that I would like to read into a Linux program. The strings were written raw from Windows' internal WCHAR format. (Does Windows always use UTF-16? e.g. in Japanese versions)
我相信我可以使用原始读取以及使用wcstombs_l进行转换来读取它们.但是,我不知道要使用什么语言环境.在最新的Ubuntu和Mac OS X计算机上运行"locale -a"会产生名称为utf-16的零个语言环境.
I believe that I can read them using raw reads and the converting with wcstombs_l. However, I cannot figure what locale to use. Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
有更好的方法吗?
更新:正确的答案和下面的其他内容帮助我指出了使用libiconv的方法.这是我用来进行转换的函数.我目前在一个类中将其转换为单行代码.
Update: the correct answer and others below helped point me to using libiconv. Here's a function I'm using to do the conversion. I currently have it inside a class that makes the conversions into a one-line piece of code.
// Function for converting wchar_t* to char*. (Really: UTF-16LE --> UTF-8)
// It will allocate the space needed for dest. The caller is
// responsible for freeing the memory.
static int iwcstombs_alloc(char **dest, const wchar_t *src)
{
iconv_t cd;
const char from[] = "UTF-16LE";
const char to[] = "UTF-8";
cd = iconv_open(to, from);
if (cd == (iconv_t)-1)
{
printf("iconv_open(\"%s\", \"%s\") failed: %s\n",
to, from, strerror(errno));
return(-1);
}
// How much space do we need?
// Guess that we need the same amount of space as used by src.
// TODO: There should be a while loop around this whole process
// that detects insufficient memory space and reallocates
// more space.
int len = sizeof(wchar_t) * (wcslen(src) + 1);
//printf("len = %d\n", len);
// Allocate space
int destLen = len * sizeof(char);
*dest = (char *)malloc(destLen);
if (*dest == NULL)
{
iconv_close(cd);
return -1;
}
// Convert
size_t inBufBytesLeft = len;
char *inBuf = (char *)src;
size_t outBufBytesLeft = destLen;
char *outBuf = (char *)*dest;
int rc = iconv(cd,
&inBuf,
&inBufBytesLeft,
&outBuf,
&outBufBytesLeft);
if (rc == -1)
{
printf("iconv() failed: %s\n", strerror(errno));
iconv_close(cd);
free(*dest);
*dest = NULL;
return -1;
}
iconv_close(cd);
return 0;
} // iwcstombs_alloc()
推荐答案
(Windows是否始终使用UTF-16?例如日语版本)
(Does Windows always use UTF-16? e.g. in Japanese versions)
是的,NT的WCHAR始终为UTF-16LE.
Yes, NT's WCHAR is always UTF-16LE.
(对于日语安装,系统代码页"的确为cp932/Shift-JIS,但在NT中仍然存在,这是因为许多非Unicode本地应用程序,FAT32路径等应用程序都可以从中受益. )
(The ‘system codepage’, which for Japanese installs is indeed cp932/Shift-JIS, still exists in NT for the benefit of the many, many applications that aren't Unicode-native, FAT32 paths, and so on.)
但是,不能保证wchar_t为16位,而在Linux上则不能保证为16位,而是使用UTF-32(UCS-4).因此wcstombs_l不太可能高兴.
However, wchar_t is not guaranteed to be 16 bits and on Linux it won't be, UTF-32 (UCS-4) is used. So wcstombs_l is unlikely to be happy.
正确的做法是使用iconv之类的库以您内部使用的任何格式读取它-大概是wchar_t.您可以尝试通过戳入字节来自己破解它,但是您可能会弄错代理人"之类的事情.
The Right Thing would be to use a library like iconv to read it in to whichever format you are using internally - presumably wchar_t. You could try to hack it yourself by poking bytes in, but you'd probably get things like the Surrogates wrong.
在最新的Ubuntu和Mac OS X计算机上运行"locale -a"会产生名称为utf-16的零个语言环境.
Runing "locale -a" on my up-to-date Ubuntu and Mac OS X machines yields zero locales with utf-16 in their names.
实际上,由于所有\ 0,Linux不能使用UTF-16作为区域设置默认编码.
Indeed, Linux can't use UTF-16 as a locale default encoding thanks to all the \0s.
这篇关于如何在Linux中使用POSIX方法从文件读取Unicode-16字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!