Python的截断国际弦 [英] Python truncating international string

查看:130
本文介绍了Python的截断国际弦的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在试图调试这个时间太长了,我显然不知道我在做什么,所以希望有人可以提供帮助。我甚至不知道我应该是什么要求,但这里有云:

I've been trying to debug this for far too long, and I obviously have no idea what I'm doing, so hopefully someone can help. I'm not even sure what I should be asking, but here it goes:

我想送苹果推送通知,他们有256字节的有效载荷大小限制。所以减去一些开销的东西,我留下的主要信息内容约100个​​英文字符。

I'm trying to send Apple Push Notifications, and they have a payload size limit of 256 bytes. So subtract some overhead stuff, and I'm left with about 100 english characters of main message content.

所以,如果消息超出了最大时间越长,我截断:

So if a message is longer than the max, I truncate it:

MAX_PUSH_LENGTH = 100
body = (body[:MAX_PUSH_LENGTH]) if len(body) > MAX_PUSH_LENGTH else body

所以这是好得很,不管多久消息我有(英文),推送通知的发送成功。不过,现在我有一个阿拉伯语字符串:

So that's fine and dandy, and no matter how long of a message I have (in english), the push notification sends successfully. However, now I have an Arabic string:

str = "هيك بنكون 
عيش بجنون تون تون تون هيك بنكون 
عيش بجنون تون تون تون 
أوكي أ"

>>> print len(str)
109

所以应该截断。但是,我总是得到一个无效的有效载荷大小的错误!出于好奇,我一直在降低门槛MAX_PUSH_LENGTH,看看它会采取为它成功,它不是,直到我将限制设置为60左右的推送通知成功了。

So that should truncate. But, I always get an invalid payload size error! Curious, I kept lowering the MAX_PUSH_LENGTH threshold to see what it would take for it to succeed, and it's not until I set the limit to around 60 that the push notification succeeded.

我不完全相信,如果这事做与英语以外的语言的字节大小。这是我的理解是一个英文字符占用一个字节,所以做一个阿拉伯字符占用2个字节?这可能有什么关系呢?

I'm not exactly sure if this has something to do with the byte size of languages other than english. It is my understanding that an English character takes one byte, so does an Arabic character take 2 bytes? Might this have something to do with it?

此外,该字符串是JSON EN codeD被罚下场之前,所以它最终看起来像这样: \\ u0647 ​​\\ u064a \\ u0643 \\ u0628 \\ u0646 \\ u0643 \\ u0648 \\ u0646 \\ n \\ u0639 \\ u064a \\ u0634 ... 莫非它正在PTED为原始字符串间$ p $,只是u0647是5个字节?

Also, the string is JSON encoded before it is sent off, so it ends up looking something like this: \u0647\u064a\u0643 \u0628\u0646\u0643\u0648\u0646 \n\u0639\u064a\u0634 ... Could it be that it is being interpreted as a raw string, and just u0647 is 5 bytes?

我应该在这里做什么?是否有任何明显的错误还是我没有问正确的问题?

What should I be doing here? Are there any obvious errors or am I not asking the right question?

推荐答案

您需要切割成长度的字节,所以你需要先 .EN code('utf-8')您的字符串,然后把它在code点的边界。

You need to cut to bytes length, so you need first to .encode('utf-8') your string, and then cut it at a code point boundary.

在UTF-8,ASCII(< = 127 )为1字节。 设置两个或两个以上最显著位字节(&GT = 192 )的字符开始字节;随后的字节数被设置最显著位的数目确定的。还有什么是延续字节。

In UTF-8, ASCII (<= 127) are 1-byte. Bytes with two or more most significant bits set (>= 192) are character-starting bytes; the number of bytes that follow is determined by the number of most significant bits set. Anything else is continuation bytes.

如果你切在中间多字节序列可能会出现问题;如果一个字符不适合,应该完全切断,高达起始字节

A problem may arise if you cut the multi-byte sequence in the middle; if a character did not fit, it should be cut completely, up to the starting byte.

下面是一些工作code:

Here's some working code:

LENGTH_BY_PREFIX = [
  (0xC0, 2), # first byte mask, total codepoint length
  (0xE0, 3), 
  (0xF0, 4),
  (0xF8, 5),
  (0xFC, 6),
]

def codepoint_length(first_byte):
    if first_byte < 128:
        return 1 # ASCII
    for mask, length in LENGTH_BY_PREFIX:
        if first_byte & mask == mask:
            return length
    assert False, 'Invalid byte %r' % first_byte

def cut_to_bytes_length(unicode_text, byte_limit):
    utf8_bytes = unicode_text.encode('UTF-8')
    cut_index = 0
    while cut_index < len(utf8_bytes):
        step = codepoint_length(ord(utf8_bytes[cut_index]))
        if cut_index + step > byte_limit:
            # can't go a whole codepoint further, time to cut
            return utf8_bytes[:cut_index]
        else:
            cut_index += step
    # length limit is longer than our bytes strung, so no cutting
    return utf8_bytes

现在测试。如果由Matchi.com提供回到code()成功,我们已经做出了正确的削减。

Now test. If .decode() succeeds, we have made a correct cut.

unicode_text = u"هيك بنكون" # note that the literal here is Unicode

print cut_to_bytes_length(unicode_text, 100).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 10).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 5).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 4).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 3).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 2).decode('UTF-8')

# This returns empty strings, because an Arabic letter
# requires at least 2 bytes to represent in UTF-8.
print cut_to_bytes_length(unicode_text, 1).decode('UTF-8')

您可以测试该code。与ASCII工程,以及

You can test that the code works with ASCII as well.

这篇关于Python的截断国际弦的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆