如果Unicode无效,验证Unicode字符串和转义(C / C ++) [英] Validate Unicode String and Escape if Unicode is Invalid (C/C++)
问题描述
我有一个从文件系统读取任意数据的程序,并以Unicode格式输出结果。我遇到的问题是有时候,文件名是有效的Unicode,有时它们不是。所以我想要一个可以验证一个字符串(C或C ++)的函数,并告诉我是否是一个有效的UTF-8编码。如果不是,我希望将无效字符转义为有效的UTF-8编码。这不同于转义为XML - 我也需要这样做。但是首先我需要确定Unicode是正确的。
I have a program that reads arbitrary data from a file system and outputs results in Unicode. The problem I am having is that sometimes filenames are valid Unicode and sometimes they aren't. So I want a function that can validate a string (in C or C++) and tell me if it is a valid UTF-8 encoding. If it is not, I want to have the invalid characters escaped so that it will be a valid UTF-8 encoding. This is different than escaping for XML --- I need to do that also. But first I need to be sure that the Unicode is right.
我看到一些代码可以从中删除,但是我宁愿使用一些工作代码,如果它存在。
I've seen some code from which I could hack this, but I would rather use some working code if it exists.
推荐答案
以下代码基于 IRI 库我已经工作了一段时间。 RFC 3987 的第3.2节(转换为IRI)涉及转换无效UTF -8个字节到有效的UTF-8。
The following code is based on an IRI library I have been working on for awhile. Section 3.2 ("Converting URIs to IRIs") of RFC 3987 deals with converting invalid UTF-8 octets to valid UTF-8.
#define IS_IN_RANGE(c, f, l) (((c) >= (f)) && ((c) <= (l)))
int UTF8BufferToUTF32Buffer(char *Data, int DataLen, unsigned long *Buffer, int BufLen, int *Eaten)
{
if( Eaten )
{
*Eaten = 0;
}
int Result = 0;
unsigned char b, b2;
unsigned char *ptr = (unsigned char*) Data;
unsigned long uc;
int i = 0;
int seqlen;
while( i < DataLen )
{
if( (Buffer) && (!BufLen) )
break;
b = ptr[i];
if( (b & 0x80) == 0 )
{
uc = (unsigned long)(b & 0x7F);
seqlen = 1;
}
else if( (b & 0xE0) == 0xC0 )
{
uc = (unsigned long)(b & 0x1F);
seqlen = 2;
}
else if( (b & 0xF0) == 0xE0 )
{
uc = (unsigned long)(b & 0x0F);
seqlen = 3;
}
else if( (b & 0xF8) == 0xF0 )
{
uc = (unsigned long)(b & 0x07);
seqlen = 4;
}
else
{
uc = 0;
return -1;
}
if( (i+seqlen) > DataLen )
{
return -1;
}
for(int j = 1; j < seqlen; ++j)
{
b = ptr[i+j];
if( (b & 0xC0) != 0x80 )
{
return -1;
}
}
switch( seqlen )
{
case 2:
{
b = ptr[i];
if( !IS_IN_RANGE(b, 0xC2, 0xDF) )
{
return -1;
}
break;
}
case 3:
{
b = ptr[i];
b2 = ptr[i+1];
if( ((b == 0xE0) && !IS_IN_RANGE(b2, 0xA0, 0xBF)) ||
((b == 0xED) && !IS_IN_RANGE(b2, 0x80, 0x9F)) ||
(!IS_IN_RANGE(b, 0xE1, 0xEC) && !IS_IN_RANGE(b, 0xEE, 0xEF)) )
{
return -1;
}
break;
}
case 4:
{
b = ptr[i];
b2 = ptr[i+1];
if( ((b == 0xF0) && !IS_IN_RANGE(b2, 0x90, 0xBF)) ||
((b == 0xF4) && !IS_IN_RANGE(b2, 0x80, 0x8F)) ||
!IS_IN_RANGE(b, 0xF1, 0xF3) )
{
return -1;
}
break;
}
}
for(int j = 1; j < seqlen; ++j)
{
uc = ((uc << 6) | (unsigned long)(ptr[i+j] & 0x3F));
}
if( Buffer )
{
*Buffer++ = uc;
--BufLen;
}
++Result;
i += seqlen;
}
if( Eaten )
{
*Eaten = i;
}
return Result;
}
{
std::string filename = "...";
unsigned long ch;
int eaten;
std::string::size_type i = 0;
while( i < filename.length() )
{
if( UTF8BufferToUTF32Buffer(&filename[i], filename.length()-i, &ch, 1, &eaten) == 1 )
{
i += eaten;
}
else
{
// replace the character at filename[i] with your chosen
// escaping, and then increment i by the number of
// characters used...
}
}
}
在你的情况下,你要做的就是决定要使用什么样的转义。 URI / IRI使用百分比编码(%NN,其中NN是八位位组的2位十六进制值)。
In your case, all you have to do is decide what kind of escaping you want to use. URIs/IRIs uses percent-encoding ("%NN", where "NN" is the 2-digit hex value of an octet).
这篇关于如果Unicode无效,验证Unicode字符串和转义(C / C ++)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!