是否有任何合理的方法来访问CharacterSet的内容? [英] Is there any reasonable way to access the contents of a CharacterSet?

查看:47
本文介绍了是否有任何合理的方法来访问CharacterSet的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一个随机字符串生成器,我认为最好使用 CharacterSet 作为要使用的字母的输入类型,因为诸如 CharacterSet.lowercaseLetters 显然很有用(即使它们包含的字符集比您期望的还要多样化)。

For a random string generator, I thought it would be nice to use CharacterSet as input type for the alphabet to use, since the pre-defined sets such as CharacterSet.lowercaseLetters are obviously useful (even if they may contain more diverse character sets than you'd expect).

但是,显然您只能查询字符集的成员资格,而不能枚举更不用说索引它们了。我们得到的只是 _。bitmapRepresentation ,一个8kb的数据块,每个(?)字符都有一个指示符位。但是,即使您通过索引 i 剥离单个位(这不太好,也要通过面向字节的 Data ),字符(UnicodeScalar(i))不能提供正确的字母。这意味着该格式有些晦涩-当然,它是未记录

However, apparently you can only query character sets for membership, but not enumerate let alone index them. All we get is _.bitmapRepresentation, a 8kb chunk of data with an indicator bit for every (?) character. But even if you peel out individual bits by index i (which is less than nice, going through byte-oriented Data), Character(UnicodeScalar(i)) does not give the correct letter. Which means that the format is somewhat obscure -- and, of course, it's not documented.

我们当然可以迭代所有字符(每平面),但这不是一个好主意,从成本角度来看:20个字符的集合可能需要迭代数万个字符。用CS术语来说:对于稀疏集,位向量是一个(非常)糟糕的实现。为什么他们选择在这里进行这种权衡,我不知道。

Of course we can iterate over all characters (per plane) but that is a bad idea, cost-wise: a 20-character set may require iterating over tens of thousands of characters. Speaking in CS terms: bit-vectors are a (very) bad implementation for sparse sets. Why they chose to make the trade-off in this way here, I have no idea.

我在这里错过了什么吗?或者是 CharacterSet Foundation API的另一个死角?

Am I missing something here, or is CharacterSet just another deadend in the Foundation API?

推荐答案

根据您的定义,不,没有合理的方法。这就是NSCharacterSet存储它的方式。

By your definition, no, there is no "reasonable" way. That's just how NSCharacterSet stores it. It's optimized for testing membership, not enumerating all members.

您的循环可以在代码点上增加一个计数器,也可以移位位(每个代码点一个),但是其中一个您必须循环测试的方式。我的Mac上最高的 Ll字符是U + 1D7CB(#120,779),因此,如果要在运行时计算此字符列表,则代码必须至少循环多次。请参阅文档的目标C版本,以了解有关如何

Your loop can increment a counter over the codepoints, or it can shift the bits (one per codepoint), but either way you have to loop and test. The highest "Ll" character on my Mac is U+1D7CB (#120,779), so if you want to compute this list of characters at runtime, your code will have to loop at least that many times. See the Objective-C version of the documentation for details on how the bit vector is organized.

好消息是这很快。在我使用10年的Mac上使用未经优化的代码后,只需不到1/10秒的时间即可找到全部1,841个 lowercaseLetters 。如果那还不够快,那么很容易通过在启动时在后台执行一次来隐藏成本。

The good news is that this is fast. With unoptimized code on my 10-year-old Mac, it takes less than 1/10th of a second to find all 1,841 lowercaseLetters. If that's still not fast enough, it's easy to hide the cost by doing it once, in the background, at startup time.

这篇关于是否有任何合理的方法来访问CharacterSet的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆