unicode,bytes redux [英] unicode, bytes redux

查看:49
本文介绍了unicode,bytes redux的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(击败一匹死马)


如果unicode对象要记住它,那么它会很好吗

这太荒谬了

对其解码后的字符串进行编码?

因此,计算构成数字的字节数为

是可行的unicode代码点。


#U + 270C

#11100010 10011100 10001100

buf =" \ xE2 \ x9C \ x8C"


u = buf.decode(''UTF-8'')


#...以后...


u.bytes()-3


(遍历每个代码点并计算

字节数根据编码组成角色


解决方案

willie< wi **** @jugots。编写:


如果unicode对象要记住

>
编码它被解码的字符串?

这样就可以计算出构成unicode代码点的字节数b / b $ b。



那么你对此有什么样的输出:


>> a =''\ xc9''。decode(''latin1'')
b =''\ xc3 \ x89''。decode( ''utf8'')
print(a + b).bytes()



???


如果你说'这是一个不公平的问题,因为你期望所有字节

字符串使用相同的编码,那么就没有必要存储它在

每个unicode对象;你可以在全球范围内存储一次。


willie< wi **** @ jamots.comwrites:


#U + 270C

#11100010 10011100 10001100

buf =" \ xE2 \ x9C \ x8C"

u = buf.decode(''UTF-8'')

#...以后......

u.bytes()-3

(遍历每个代码点并计算

组成字符的字节数

根据编码)


Duncan Booth解释了为什么这不起作用。但是我没有看到任何字节数函数允许你指定编码的大问题:


u = buf.decode(''UTF -8'')

#...以后......

u.bytes(''UTF-8'')-3

u.bytes(''UCS-4'')-4


这样可以避免在内存中创建一个新的编码字符串,以及一些

编码,避免必须扫描unicode字符串以添加

长度。


Paul Rubin写道:


Duncan Booth解释了为什么这不起作用。但是我没有看到任何字节数函数允许你指定编码的大问题:


u = buf.decode(''UTF -8'')

#...以后......

u.bytes(''UTF-8'')-3

u.bytes(''UCS-4'')-4


这样可以避免在内存中创建一个新的编码字符串,以及一些

编码,避免必须扫描unicode字符串以添加

长度。



它需要对代码和API进行相当大的更改才能获得相对较少的b
不常见的问题。您需要知道

编码的Unicode字符串占用多少字节而不需要编码字符串本身?


(beating a dead horse)

Is it too ridiculous to suggest that it''d be nice
if the unicode object were to remember the
encoding of the string it was decoded from?
So that it''s feasible to calculate the number
of bytes that make up the unicode code points.

# U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"

u = buf.decode(''UTF-8'')

# ... later ...

u.bytes() -3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)

解决方案

willie <wi****@jamots.comwrote:

Is it too ridiculous to suggest that it''d be nice
if the unicode object were to remember the
encoding of the string it was decoded from?
So that it''s feasible to calculate the number
of bytes that make up the unicode code points.

So what sort of output do you expect from this:

>>a = ''\xc9''.decode(''latin1'')
b = ''\xc3\x89''.decode(''utf8'')
print (a+b).bytes()

???

And if you say that''s an unfair question because you expected all the byte
strings to be using the same encoding then there''s no point storing it on
every unicode object; you might as well store it once globally.


willie <wi****@jamots.comwrites:

# U+270C
# 11100010 10011100 10001100
buf = "\xE2\x9C\x8C"
u = buf.decode(''UTF-8'')
# ... later ...
u.bytes() -3

(goes through each code point and calculates
the number of bytes that make up the character
according to the encoding)

Duncan Booth explains why that doesn''t work. But I don''t see any big
problem with a byte count function that lets you specify an encoding:

u = buf.decode(''UTF-8'')
# ... later ...
u.bytes(''UTF-8'') -3
u.bytes(''UCS-4'') -4

That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.


Paul Rubin wrote:

Duncan Booth explains why that doesn''t work. But I don''t see any big
problem with a byte count function that lets you specify an encoding:

u = buf.decode(''UTF-8'')
# ... later ...
u.bytes(''UTF-8'') -3
u.bytes(''UCS-4'') -4

That avoids creating a new encoded string in memory, and for some
encodings, avoids having to scan the unicode string to add up the
lengths.

It requires a fairly large change to code and API for a relatively
uncommon problem. How often do you need to know how many bytes an
encoded Unicode string takes up without needing the encoded string itself?


这篇关于unicode,bytes redux的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆