将部分 UTF-8 解码为 NSString [英] Decoding partial UTF-8 into NSString

查看:56
本文介绍了将部分 UTF-8 解码为 NSString的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当使用 NSURLConnection 类通过网络获取 UTF-8 编码的文件时,委托的 connection:didReceiveData: 消息将与一个 NSData 一起发送,它会截断 UTF-8 文件 - 因为 UTF-8 是一种多字节编码方案,并且单个字符可以在两个单独的 NSData

While fetching a UTF-8-encoded file over the network using the NSURLConnection class, there's a good chance the delegate's connection:didReceiveData: message will be sent with an NSData which truncates the UTF-8 file - because UTF-8 is a multi-byte encoding scheme, and a single character can be sent in two separate NSData

换句话说,如果我加入从 connection:didReceiveData: 获得的所有数据,我将有一个有效的 UTF-8 文件,但每个单独的数据不是有效的 UTF-8 ().

In other words, if I join all the data I get from connection:didReceiveData: I will have a valid UTF-8 file, but each separate data is not valid UTF-8 ().

我不想将所有下载的文件都存储在内存中.

I do not want to store all the downloaded file in memory.

我想要的是:给定NSData,将任何你能解码的东西解码成NSString.万一最后NSData 的几个字节是一个未封闭的代理,告诉我,这样我就可以为下一个 NSData 保存它们.

What I want is: given NSData, decode whatever you can into an NSString. In case the last few byte of the NSData are an unclosed surrogate, tell me, so I can save them for the next NSData.

一个明显的解决方案是反复尝试使用 initWithData:encoding: 进行解码,每次都截断最后一个字节,直到成功.不幸的是,这可能非常浪费.

One obvious solution is repeatedly trying to decode using initWithData:encoding:, each time truncating the last byte, until success. This, unfortunately, can be very wasteful.

推荐答案

如果您想确保不会停在 UTF-8 多字节序列的中间,您将需要查看在字节数组的末尾并检查前 2 位.

If you want to make sure that you don't stop in the middle of a UTF-8 multi-byte sequence, you're going to need to look at the end of the byte array and check the top 2 bits.

  1. 如果最高位为 0,则它是 ASCII 样式的未转义 UTF-8 代码之一,您就大功告成了.
  2. 如果最高位为 1 并且第二个为 0,则它是转义序列的延续,可能代表该序列的最后一个字节,因此您需要缓冲该字符以备后用,然后查看在前一个字符*
  3. 如果最高位是 1 并且倒数第二位也是 1,那么它是多字节序列的开始,您需要通过查找第一个 0 位来确定序列中有多少个字符.

查看维基百科条目中的多字节表:http://en.wikipedia.组织/维基/UTF-8

Look at the multi-byte table in the Wikipedia entry: http://en.wikipedia.org/wiki/UTF-8

// assumes that receivedData contains both the leftovers and the new data

unsigned char *data= [receivedData bytes];
UInteger byteCount= [receivedData length];

if (byteCount<1)
    return nil;  // or @"";

unsigned char *lastByte = data[byteCount-1];
if ( lastByte & 0x80 == 0) {
    NSString *newString = [NSString initWithBytes: data length: byteCount 
                                    encoding: NSUTF8Encoding];
    // verify success
    // remove bytes from mutable receivedData, or set overflow to empty
    return newString;
}

// now eat all of the continuation bytes
UInteger backCount=0;
while ( (byteCount > 0)  && (lastByte & 0xc0 == 0x80)) {
    backCount++;
    byteCount--;
    lastByte = data[byteCount-1];
}
// at this point, either we have exhausted byteCount or we have the initial character
// if we exhaust the byte count we're probably in an illegal sequence, as we should 
// always have the initial character in the receivedData

if (byteCount<1) {
    // error!
    return nil;
}

// at this point, you can either use just byteCount, or you can compute the 
// length of the sequence from the lastByte in order
// to determine if you have exactly the right number of characters to decode UTF-8.

UInteger requiredBytes = 0;
if (lastByte & 0xe0 == 0xc0) {  // 110xxxxx
    // 2 byte sequence
    requiredBytes= 1;
} else if (lastByte & 0xf0 == 0xe0) {   // 1110xxxx
    // 3 byte sequence
    requiredBytes= 2;
} else if (lastByte & 0xf8 == 0xf0) {   // 11110xxx
    // 4 byte sequence
    requiredBytes= 3;
} else if (lastByte & 0xfc == 0xf8) {   // 111110xx
    // 5 byte sequence
    requiredBytes= 4;
} else if (lastByte & 0xfe == 0xfc) {   // 1111110x
    // 6 byte sequence
    requiredBytes= 5;
 } else {
    // shouldn't happen, illegal UTF8 seq
 }

 // now we know how many characters we need and we know how many
 //  (backCount) we have, so either use them, or take the 
 // introductory character away.
 if (requiredBytes==backCount) {
     // we have the right number of bytes
     byteCount += backCount;
 } else { 
     // we don't have the right number of bytes, so remove the intro character 
     byteCount -= 1;   
 }

 NSString *newString = [NSString initWithBytes: data length: byteCount 
                                 encoding: NSUTF8Encoding];
 // verify success
 // remove byteCount bytes from mutable receivedData, or set overflow to the 
 // bytes between byteCount and [receivedData count]
 return newString;

这篇关于将部分 UTF-8 解码为 NSString的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆