将字符编码映射到每个字符的最大字节数 [英] Mapping of character encodings to maximum bytes per character

查看:191
本文介绍了将字符编码映射到每个字符的最大字节数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个表,将给定的字符编码映射到每个字符的字节(最大值,在可变长度编码的情况下)。对于固定宽度的编码,这是很容易,虽然我不知道,在一些更深奥的编码的情况下,那是什么宽度。对于UTF-8等,最好根据字符串中最高的代码点确定每个字符的最大字节数,。

I'm looking for a table that maps a given character encoding to the (maximum, in the case of variable length encodings) bytes per character. For fixed-width encodings this is easy enough, though I don't know, in the case of some of the more esoteric encodings, what that width is. For UTF-8 and the like it would also be nice to determine the maximum bytes per character depending on the highest codepoint in a string, but this is less pressing.

对于一些背景(你可以忽略,如果你不熟悉Numpy,我正在研究一个 ndarray 子类的原型,具有一些透明度,表示编码字节数组(包括纯ASCII)作为unicode字符串数组,而不是实际上将整个数组一次性转换为UCS4的想法是底层的 dtype 仍然是 S dtype,其中 是数组中每个字符串的最大字节数,项目查找和字符串方法使用正确的编码解码字符串。一个非常粗糙的原型可以看到这里,虽然最终的部分可能在C中实现。对我的使用情况最重要的事情是高效使用内存,而重复解码和重新编码字符串是可以接受的开销。

For some background (which you can ignore, if you're not familiar with Numpy, I'm working on a prototype for an ndarray subclass that can, with some transparency, represent arrays of encoded bytes (including plain ASCII) as arrays of unicode strings without actually converting the entire array to UCS4 at once. The idea is that the underlying dtype is still an S<N> dtype, where <N> is the (maximum) number of bytes per string in the array. But item lookups and string methods decode the strings on the fly using the correct encoding. A very rough prototype can be seen here, though eventually parts of this will likely be implemented in C. The most important thing for my use case is efficient use of memory, while repeated decoding and re-encoding of strings is acceptable overhead.

无论如何,因为底层dtype是字节,不告诉用户可以写入给定的编码文本数组。因此,对于任意编码,这样的地图对于改进用户界面非常有用。

Anyways, because the underling dtype is in bytes, that does not tell users anything useful about the lengths of strings that can be written to a given encoded text array. So having such a map for arbitrary encodings would be very useful for improving the user interface if nothing else.

注意:基本上与具体相同的问题在这里:如何以编程方式确定特定字符集中字符的最大大小(以字节为单位)?但是,我找不到任何等价的Python,也不是一个有用的信息数据库,我可以实现自己的。

Note: I found an answer to basically the same question that is specific to Java here: How can I programatically determine the maximum size in bytes of a character in a specific charset? However, I haven't been able to find any equivalent in Python, nor a useful database of information whereby I might implement my own.

推荐答案

迭代所有可能的Unicode字符并跟踪使用的最大字节数。

The brute-force approach. Iterate over all possible Unicode characters and track the greatest number of bytes used.

def max_bytes_per_char(encoding):
    max_bytes = 0
    for codepoint in range(0x110000):
        try:
            encoded = chr(codepoint).encode(encoding)
            max_bytes = max(max_bytes, len(encoded))
        except UnicodeError:
            pass
    return max_bytes


>>> max_bytes_per_char('UTF-8')
4

这篇关于将字符编码映射到每个字符的最大字节数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆