从NSData创建NSString时的猜测编码 [英] Guess encoding when creating an NSString from NSData

查看:129
本文介绍了从NSData创建NSString时的猜测编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当从文件中读取 NSString 时,我可以使用 initWithContentsOfFile:usedEncoding:error:

When reading an NSString from a file I can use initWithContentsOfFile:usedEncoding:error: and it will guess the encoding of the file.

当我从 NSData 创建它时,我唯一的选择是 initWithData:encoding:其中我必须显式地传递编码。在使用 NSData 而不是文件时,如何可靠地猜测编码?

When I create it from an NSData though my only option is initWithData:encoding: where I have to explicitly pass the encoding. How can I reliably guess the encoding when I work with NSData instead of files?

推荐答案

一般来说,你不能。然而,你可以相当可靠地识别UTF-8文件 - 如果一个文件是有效的UTF-8,它不应该是任何其他编码(除非所有的字节都在ASCII范围,在这种情况下任何扩展ASCII编码,包括UTF-8,将给你相同的结果)。所有Unicode编码还有一个可选的 BOM ,用于标识它们。因此,合理的方法是:

In general, you can’t. However, you can quite reliably identify UTF-8 files – if a file is valid UTF-8, it’s not very likely that it’s supposed to be any other encoding (except if all the bytes are in the ASCII range, in which case any "extended ASCII" encoding, including UTF-8, will give you the same result). All Unicode encodings also have an optional BOM which identifies them. So a reasonable approach would be:


  • 查找有效的BOM。如果有,请使用适当的编码。

  • 否则,尝试将其解释为UTF-8。你可以调用 initWithData:data encoding:NSUTF8StringEncoding 并检查结果是否为nil。

  • 使用默认的8位编码,例如 - [NSString defaultCStringEncoding] (它提供了一个适合语言环境的猜测)。

  • Look for a valid BOM. If there is one, use the appropriate encoding.
  • Otherwise, try to interpret it as UTF-8. You can do this by calling initWithData:data encoding:NSUTF8StringEncoding and checking if the result is non-nil.
  • If that fails, use a default 8-bit encoding, such as -[NSString defaultCStringEncoding] (which provides a locale-appropriate guess).

是可以尝试通过尝试各种不同的编码,并选择一个具有最少的字母序列的垃圾在最后一步改善猜测中间,其中垃圾是不是字母,空格或常用标点符号的任何字符。这将显着增加复杂性,但实际上并不可靠。

It is possible to try to improve the guess in the last step by trying various different encodings and choosing the one which has fewest sequences of letters with junk in the middle, where "junk" is any character that’s not a letter, space or common punctuation mark. This would significantly increase complexity while not actually being reliable.

简而言之,为了能够处理所有可用的编码,你需要做TextEdit做的事情:用户。

In short, to be able to handle all available encodings you need to do what TextEdit does: shunt the decision over to the user.

哦,还有一件事:从10.5开始,编码通常存储在一个文件中的未公开的com.apple.TextEncoding扩展属性。如果您使用 + [NSString stringWithContentsOfFile:] 或类似文件打开一个文件,这将自动使用(如果存在)。

Oh, one more thing: as of 10.5, the encoding is often stored with a file in the undocumented com.apple.TextEncoding extended attribute. If you open a file with +[NSString stringWithContentsOfFile:] or similar, this will automatically be used if present.

这篇关于从NSData创建NSString时的猜测编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆