使用部分缓冲区将多字节unicode字节数组转换为NSString [英] Convert a multi-byte unicode byte array into an NSString, using a partial buffer

查看:103
本文介绍了使用部分缓冲区将多字节unicode字节数组转换为NSString的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Objective C中,有一种方法可以将多字节的unicode字节数组转换为NSString,即使数组数据是部分缓冲区(不在完整的字符边界上),它也将允许转换成功进行?/p>

此方法的应用是在流中接收字节缓冲区时,您想解析数据缓冲区的字符串版本(但是有更多数据要来,并且您的缓冲区数据没有完整的多字节unicode ).

NSString的initWithData:encoding:方法不适用于此目的,如此处所示...

测试代码:

    - (void)test {
        char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'};
        size_t sizeOfMyArray = sizeof(myArray);
        [self dump:myArray sizeOfMyArray:sizeOfMyArray];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 1];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 2];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 3];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 4];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 5];
    }

    - (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength {
        NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding];
        NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string);
    }

输出:

sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar'
sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba'
sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b'
sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×'
sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)'
sourceLength: 3 bytes, string.length: 3 bytes, string :'foo'

可以看出,转换"sourceLength:4个字节"字节数组失败,并返回(null).这是因为只部分包含了UTF-8 Unicode'×'字符(0xc3 0x97).

理想情况下,我可以使用一个函数来返回正确的NString,并告诉我剩余"字节数.

解决方案

我以前遇到过此问题,暂时忘记了.这是一个机会.下面的代码是根据Wikipedia上 utf-8页面上的信息完成的.这是NSData上的类别.

它从头开始检查数据,仅从最后四个字节开始检查数据,因为OP表示它可以是千兆字节的数据.否则,使用utf-8可以更轻松地从头开始遍历所有字节.

/* 
 Return the range of a valid utf-8 encoded text by
 removing partial trailing multi-byte char.
 It assumes that all the bytes are valid utf-8 encoded char,
 e.g. it don't raise a flag if a continuation byte is preceded
 by a single char byte.
 */
 - (NSRange)rangeOfUTF8WithoutPartialTrailingMultibytes
 {
    NSRange validRange = {0, 0};

    NSUInteger trailLength = MIN([self length], 4U);
    unsigned char trail[4];
    [self getBytes:&trail
             range:NSMakeRange([self length] - trailLength, trailLength)];

    unsigned multibyteCount = 0;

    for (NSInteger i = trailLength - 1; i >= 0; i--) {
        if (isUTF8SingleByte(trail[i])) {
            validRange = NSMakeRange(0, [self length] - trailLength + i + 1);
            break;
        }

        if (isUTF8ContinuationByte(trail[i])) {
            multibyteCount++;
            continue;
        }

        if (isUTF8StartByte(trail[i])) {
            multibyteCount++;
            if (multibyteCount == lengthForUTF8StartByte(trail[i])) {
                validRange = NSMakeRange(0, [self length] - trailLength + i + multibyteCount);
            }
            else {
                validRange = NSMakeRange(0, [self length] - trailLength + i);
            } 
            break;
        }
    }
    return validRange;
}

以下是该方法中使用的静态函数:

static BOOL isUTF8SingleByte(const unsigned char c)
{
    return c <= 0x7f;
}

static BOOL isUTF8ContinuationByte(const unsigned char c)
{
    return (c >= 0x80) && (c <= 0xbf);
}

static BOOL isUTF8StartByte(const unsigned char c)
{
    return (c >= 0xc2) && (c <= 0xf4);
}

static BOOL isUTF8InvalidByte(const unsigned char c)
{
    return (c == 0xc0) || (c == 0xc1) || (c > 0xf4);
}

static unsigned lengthForUTF8StartByte(const unsigned char c)
{
    if ((c >= 0xc2) && (c <= 0xdf)) {
        return 2;
    }
    else if ((c >= 0xe0) && (c <= 0xef)) {
        return 3;
    }
    else if ((c >= 0xf0) && (c <= 0xf4)) {
        return 4;
    }
    return 1;
}

In Objective C is there a way to convert a multi-byte unicode byte array into an NSString, where it will allow the conversion to succeed even if the array data is a partial buffer (not on a complete character boundary)?

The application of this is when receiving byte buffers in a stream, and you want to parse the string version of the data buffer (but there is more data to come, and your buffer data doesn't have complete multi-byte unicode).

NSString's initWithData:encoding: method does not work for this purpose, as shown here...

Test code:

    - (void)test {
        char myArray[] = {'f', 'o', 'o', (char) 0xc3, (char) 0x97, 'b', 'a', 'r'};
        size_t sizeOfMyArray = sizeof(myArray);
        [self dump:myArray sizeOfMyArray:sizeOfMyArray];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 1];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 2];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 3];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 4];
        [self dump:myArray sizeOfMyArray:sizeOfMyArray - 5];
    }

    - (void)dump:(char[])myArray sizeOfMyArray:(size_t)sourceLength {
        NSString *string = [[NSString alloc] initWithData:[NSData dataWithBytes:myArray length:sourceLength] encoding:NSUTF8StringEncoding];
        NSLog(@"sourceLength: %lu bytes, string.length: %i bytes, string :'%@'", sourceLength, string.length, string);
    }

Output:

sourceLength: 8 bytes, string.length: 7 bytes, string :'foo×bar'
sourceLength: 7 bytes, string.length: 6 bytes, string :'foo×ba'
sourceLength: 6 bytes, string.length: 5 bytes, string :'foo×b'
sourceLength: 5 bytes, string.length: 4 bytes, string :'foo×'
sourceLength: 4 bytes, string.length: 0 bytes, string :'(null)'
sourceLength: 3 bytes, string.length: 3 bytes, string :'foo'

As can be seen, converting the "sourceLength: 4 bytes" byte array fails, and returns (null). This is because the UTF-8 unicode '×' character (0xc3 0x97) is only partially included.

Ideally there would be a function that I can use that would return the correct NString, and tell me how many bytes are "left over".

解决方案

I had this problem before and forget it for a while. It was an opportunity to do it. The code below is done with informations from the utf-8 page on wikipedia. It is a category on NSData.

It check the data from the end and only the four last bytes because the OP said that it can be giga byte of data. Otherwise with utf-8 it's simpler to run through the bytes from the beginning.

/* 
 Return the range of a valid utf-8 encoded text by
 removing partial trailing multi-byte char.
 It assumes that all the bytes are valid utf-8 encoded char,
 e.g. it don't raise a flag if a continuation byte is preceded
 by a single char byte.
 */
 - (NSRange)rangeOfUTF8WithoutPartialTrailingMultibytes
 {
    NSRange validRange = {0, 0};

    NSUInteger trailLength = MIN([self length], 4U);
    unsigned char trail[4];
    [self getBytes:&trail
             range:NSMakeRange([self length] - trailLength, trailLength)];

    unsigned multibyteCount = 0;

    for (NSInteger i = trailLength - 1; i >= 0; i--) {
        if (isUTF8SingleByte(trail[i])) {
            validRange = NSMakeRange(0, [self length] - trailLength + i + 1);
            break;
        }

        if (isUTF8ContinuationByte(trail[i])) {
            multibyteCount++;
            continue;
        }

        if (isUTF8StartByte(trail[i])) {
            multibyteCount++;
            if (multibyteCount == lengthForUTF8StartByte(trail[i])) {
                validRange = NSMakeRange(0, [self length] - trailLength + i + multibyteCount);
            }
            else {
                validRange = NSMakeRange(0, [self length] - trailLength + i);
            } 
            break;
        }
    }
    return validRange;
}

Here is the static functions used in the method:

static BOOL isUTF8SingleByte(const unsigned char c)
{
    return c <= 0x7f;
}

static BOOL isUTF8ContinuationByte(const unsigned char c)
{
    return (c >= 0x80) && (c <= 0xbf);
}

static BOOL isUTF8StartByte(const unsigned char c)
{
    return (c >= 0xc2) && (c <= 0xf4);
}

static BOOL isUTF8InvalidByte(const unsigned char c)
{
    return (c == 0xc0) || (c == 0xc1) || (c > 0xf4);
}

static unsigned lengthForUTF8StartByte(const unsigned char c)
{
    if ((c >= 0xc2) && (c <= 0xdf)) {
        return 2;
    }
    else if ((c >= 0xe0) && (c <= 0xef)) {
        return 3;
    }
    else if ((c >= 0xf0) && (c <= 0xf4)) {
        return 4;
    }
    return 1;
}

这篇关于使用部分缓冲区将多字节unicode字节数组转换为NSString的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆