拉丁字符,MultiByteToWideChar和UTF-8 [英] Latin characters, MultiByteToWideChar and UTF-8

查看:100
本文介绍了拉丁字符,MultiByteToWideChar和UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好

我收到的是将我在html上收到的字符串翻译成宽字符(在ISAPI扩展名中)



奇怪的是,下面的代码适用于亚洲字符但不适用于拉丁口音。

我怀疑拉丁口音应该已编码......但它们不是



Hello
I received to translate strings that I received on a html into wide char (in an ISAPI extension)

Strangely, the code below works with asian characters but not with latin accents.
I suspect that latin accents should have been encoded... but they are not

char * pString = "éàa";
int nSize = MultiByteToWideChar(CP_UTF8,0,pString,-1,NULL,0);
if ( nSize != 0 )
{
    WCHAR * pBuffer= new WCHAR[nSize];
    MultiByteToWideChar(CP_UTF8,0,pString,-1,pBuffer,nSize);
    // returns 65533, 65533,97: the 2 accents are not recognised

}





那么我需要做什么来改变我收到的正确unicode宽的字​​符串char?

在此先感谢,

Jerry



So what do I need to do to transform the string I receive in proper unicode wide char?
Thanks in advance,
Jerry

推荐答案

您的输入字符串pString不是UTF-8格式,但采用扩展ASCII格式。如果您输入合法的UTF-8,我会认为您的代码可以正常工作。



重音字符将由UTF-8中的2字节序列表示。您可能希望尝试使用转义插入正确的UTF-8序列。
Your input string pString is not in UTF-8 format, but in extended ASCII format. If you input a legal UTF-8, I would assume that your code would work.

The accented characters will be represented by 2-byte sequences in UTF-8. You might want to try to insert the proper UTF-8 sequence by using escapes.


如果您通过HTML接收字符串,则编码可以在页面中指定或由某些规则确定(与浏览器知道所使用的编码的方式相同。



据我所知,在HTML页面中,通常会指定UTF-8。如果没有,那么编码应该是ASCII,重点字符应该使用HTML实体编码(例如é)。



Neverthless,我认为有些浏览器将假设Windows ANSI(西方)代码页,如果没有指定,并且有一些字符> = 128.你不应该依赖它。



如另一个提到解决方案有时可能会使用某些规则进行猜测,但只要有可能,您就不应该依赖它,但您应该知道源中使用了哪种编码以及目标中需要哪种编码并进行适当的转换。



通常您应该使用UTF-8编码,如果在Windows上工作,您也可以使用UTF-16。
If you receive a string by HTML, then encoding is either specified in the page or determined by some rules (the same way a browser know the encoding that was used).

As far as I know, in an HTML page, UTF-8 is usually specified. If not, then the encoding is supposed to be ASCII and accentued characters should have been coded using HTML entities (for example é).

Neverthless, I think that some browsers will assume Windows ANSI (western) code page if nothing specified and there are some characters >= 128. You should not rely on this.

As mentionned in another solution it is sometime possible to guess using some rules but whenever possible you should not rely on that but you should known which encoding was used in the source and which one you want for the target and do the appropriate conversion.

Usually you should use UTF-8 encoding and if working on Windows, you might use UTF-16 too.


最后我使用了下面的代码,以确保我存储正确的utf-8字符串,从我收到的(它是ISAPI扩展,因此不使用UNICODE)

In the end I have used the following code to make sure that I store proper utf-8 strings from what I recieve (it's for an ISAPI extension hence UNICODE is not used)
#ifndef _UNICODE
CString CStaticTools::MakeUTF8Compatible(const CString & strData)
{

	int nSize = MultiByteToWideChar(CP_UTF8,MB_ERR_INVALID_CHARS,strData,-1,NULL,0);
	if ( nSize != 0 )
		return strData;

	nSize = MultiByteToWideChar(CP_ACP,MB_ERR_INVALID_CHARS,strData,-1,NULL,0);
	if ( nSize == 0 )
		return strData;

	WCHAR * pBuffer= new WCHAR[nSize];
	MultiByteToWideChar(CP_ACP,0,strData,-1,pBuffer,nSize);

	BOOL bUsed = false;
	int nUtfSize = WideCharToMultiByte(CP_UTF8,0,pBuffer,-1,NULL,0,NULL,NULL);

	if ( nUtfSize == 0)
	{
		delete pBuffer;
		return strData;
	}


	char * pDest = new char[nUtfSize];

	WideCharToMultiByte(CP_UTF8,0,pBuffer,-1,pDest,nUtfSize,NULL,NULL);
	
	CString strResult = pDest;
	
	delete [] pBuffer;
	delete pDest;

	return strResult;
}
#endif





请告知我,如果我错过了什么

感谢Philippe(merci!)和nv3!



Please tell me if I have missed out something
Thanks Philippe (merci!) and nv3!


这篇关于拉丁字符,MultiByteToWideChar和UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆