将unicode字符串拆分为300个字节的块,而不会破坏字符 [英] Split unicode string into 300 byte chunks without destroying characters

查看:115
本文介绍了将unicode字符串拆分为300个字节的块,而不会破坏字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将u"an arbitrary unicode string"分成大约300个字节的块,而又不破坏任何字符.字符串将使用unicode_string.encode("utf8")写入需要utf8的套接字.我不想破坏任何角色.我该怎么办?

I want to split u"an arbitrary unicode string" into chunks of say 300 bytes without destroying any characters. The strings will be written to a socket that expects utf8 using unicode_string.encode("utf8"). I don't want to destroy any characters. How would I do this?

推荐答案

UTF-8就是为此而设计的.

UTF-8 is designed for this.

def split_utf8(s, n):
    """Split UTF-8 s into chunks of maximum length n."""
    while len(s) > n:
        k = n
        while (ord(s[k]) & 0xc0) == 0x80:
            k -= 1
        yield s[:k]
        s = s[k:]
    yield s

未经测试.但是您找到了一个拆分的地方,然后回溯直到到达角色的开头.

Not tested. But you find a place to split, then backtrack until you reach the beginning of a character.

但是,如果用户可能想查看单个块,则可能希望在字素簇边界上进行分割.这要复杂得多,但不是很棘手.例如,在"é"中,您可能不想将"e""´"分开.或者,您可能不在乎,只要它们最后再次粘在一起即可.

However, if a user might ever want to see an individual chunk, you may want to split on grapheme cluster boundaries instead. This is significantly more complicated, but not intractable. For example, in "é", you might not want to split apart the "e" and the "´". Or you might not care, as long as they get stuck together again in the end.

这篇关于将unicode字符串拆分为300个字节的块,而不会破坏字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆