Python截断国际字符串 [英] Python truncating international string

查看:32
本文介绍了Python截断国际字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试调试这个太久了,我显然不知道我在做什么,所以希望有人能提供帮助.我什至不确定我应该问什么,但它是这样的:

I've been trying to debug this for far too long, and I obviously have no idea what I'm doing, so hopefully someone can help. I'm not even sure what I should be asking, but here it goes:

我正在尝试发送 Apple 推送通知,它们的有效负载大小限制为 256 字节.所以减去一些开销的东西,我只剩下大约 100 个英文字符的主要消息内容.

I'm trying to send Apple Push Notifications, and they have a payload size limit of 256 bytes. So subtract some overhead stuff, and I'm left with about 100 english characters of main message content.

因此,如果消息长度超过最大值,我会将其截断:

So if a message is longer than the max, I truncate it:

MAX_PUSH_LENGTH = 100
body = (body[:MAX_PUSH_LENGTH]) if len(body) > MAX_PUSH_LENGTH else body

所以这很好,而且不管我有多长的消息(英文),推送通知都会成功发送.但是,现在我有一个阿拉伯字符串:

So that's fine and dandy, and no matter how long of a message I have (in english), the push notification sends successfully. However, now I have an Arabic string:

str = "هيك بنكون 
عيش بجنون تون تون تون هيك بنكون 
عيش بجنون تون تون تون 
أوكي أ"

>>> print len(str)
109

所以应该截断.但是,我总是收到无效负载大小错误!奇怪的是,我不断降低 MAX_PUSH_LENGTH 阈值,看看它需要什么才能成功,直到我将限制设置为 60 左右,推送通知才成功.

So that should truncate. But, I always get an invalid payload size error! Curious, I kept lowering the MAX_PUSH_LENGTH threshold to see what it would take for it to succeed, and it's not until I set the limit to around 60 that the push notification succeeded.

我不确定这是否与英语以外的语言的字节大小有关.我的理解是英文字符需要一个字节,那么阿拉伯字符需要 2 个字节吗?会不会跟这个有关?

I'm not exactly sure if this has something to do with the byte size of languages other than english. It is my understanding that an English character takes one byte, so does an Arabic character take 2 bytes? Might this have something to do with it?

此外,字符串在发送之前是 JSON 编码的,所以它最终看起来像这样:\u0634 ... 会不会是被解释为原始字符串,而u0647只有5个字节?

Also, the string is JSON encoded before it is sent off, so it ends up looking something like this: \u0647\u064a\u0643 \u0628\u0646\u0643\u0648\u0646 \n\u0639\u064a\u0634 ... Could it be that it is being interpreted as a raw string, and just u0647 is 5 bytes?

我应该在这里做什么?是否有任何明显的错误,或者我问的问题不正确?

What should I be doing here? Are there any obvious errors or am I not asking the right question?

推荐答案

你需要剪成字节长度,所以你首先需要.encode('utf-8')你的字符串,并且然后在代码点边界处切割.

You need to cut to bytes length, so you need first to .encode('utf-8') your string, and then cut it at a code point boundary.

在 UTF-8 中,ASCII (<=127) 是 1 个字节.设置了两个或多个最高有效位的字节(>= 192) 是字符起始字节;后面的字节数由设置的最高有效位的数量决定.其他任何东西都是连续字节.

In UTF-8, ASCII (<= 127) are 1-byte. Bytes with two or more most significant bits set (>= 192) are character-starting bytes; the number of bytes that follow is determined by the number of most significant bits set. Anything else is continuation bytes.

如果把中间的多字节序列切掉可能会出现问题;如果字符不适合,则应将其完全剪切,直到起始字节.

A problem may arise if you cut the multi-byte sequence in the middle; if a character did not fit, it should be cut completely, up to the starting byte.

这是一些工作代码:

LENGTH_BY_PREFIX = [
  (0xC0, 2), # first byte mask, total codepoint length
  (0xE0, 3), 
  (0xF0, 4),
  (0xF8, 5),
  (0xFC, 6),
]

def codepoint_length(first_byte):
    if first_byte < 128:
        return 1 # ASCII
    for mask, length in LENGTH_BY_PREFIX:
        if first_byte & mask == mask:
            return length
    assert False, 'Invalid byte %r' % first_byte

def cut_to_bytes_length(unicode_text, byte_limit):
    utf8_bytes = unicode_text.encode('UTF-8')
    cut_index = 0
    while cut_index < len(utf8_bytes):
        step = codepoint_length(ord(utf8_bytes[cut_index]))
        if cut_index + step > byte_limit:
            # can't go a whole codepoint further, time to cut
            return utf8_bytes[:cut_index]
        else:
            cut_index += step
    # length limit is longer than our bytes strung, so no cutting
    return utf8_bytes

现在测试.如果 .decode() 成功,我们就做出了正确的切割.

Now test. If .decode() succeeds, we have made a correct cut.

unicode_text = u"هيك بنكون" # note that the literal here is Unicode

print cut_to_bytes_length(unicode_text, 100).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 10).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 5).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 4).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 3).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 2).decode('UTF-8')

# This returns empty strings, because an Arabic letter
# requires at least 2 bytes to represent in UTF-8.
print cut_to_bytes_length(unicode_text, 1).decode('UTF-8')

您可以测试该代码是否也适用于 ASCII.

You can test that the code works with ASCII as well.

这篇关于Python截断国际字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆