如何删除python字符串的最后utf8字符 [英] How to remove last utf8 char of a python string

查看:215
本文介绍了如何删除python字符串的最后utf8字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含utf-8编码文本的字符串.我需要删除最后一个utf-8字符.

I have a string containing utf-8 encoded text. I need to remove the last utf-8 character.

到目前为止,我做到了

msg = msg[:-1]

但是这只会删除最后一个字节.只要最后一个字符是ASCII码,它就可以工作.当最后一个字符是多字节字符时,它将不再起作用.

but this only removes the last byte. It works as long as the last character is an ASCII code. It doesn't work anymore when the last character is a multibyte character.

推荐答案

最简单的方法是将UTF-8字节解码为Unicode文本:

The simplest way is to decode your UTF-8 bytes to Unicode text:

without_last = msg.decode('utf8')[:-1]

您始终可以再次对其进行编码.

You can always encode it again.

另一种选择是让您搜索 UTF-8起始字节; UTF-8字节序列始终以一个最高有效位设置为0或两个最高有效位设置为1的字节开始,而连续字节始终以10开头:

The alternative would be for you to search for a UTF-8 start byte; UTF-8 byte sequences always start with a byte with the most significant bit set to 0, or the two most significant bits set to 1, while continuation bytes always start with 10:

# find starting byte of last codepoint
pos = len(msg) - 1
while pos > -1 and ord(msg[pos]) & 0xC0 == 0x80:
    # character at pos is a continuation byte (bit 7 set, bit 6 not)
    pos -= 1
msg = msg[:pos]

这篇关于如何删除python字符串的最后utf8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆