关于解码/编码的一些问题 [英] Some questions about decode/encode

查看:86
本文介绍了关于解码/编码的一些问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里使用中文字符作为例子。


>>> s1 =''?? o ?? e''
repr(s1)



"''\ \xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "


>>> b1 = s1.decode(''GBK'')



我的第一个问题是:解码使用什么策略来说明

分开的话。我的意思是因为s1是一个多字节字符串,

它是如何确定每2字节或1字节分隔字符串的?

我的第二个问题是:有没有一个谁测试了很长的mbcs

解码?我试图解码一个长的(20 + MB)xml昨天,结果是

非常奇怪,导致SAX无法解析解码后的字符串。

但是,我使用另一个文本编辑器将文件转换为utf-8和

SAX将成功解析内容。


我不确定是否有一些特殊字节数组或太长的文字导致了这个问题。或许这就是python 2.5的BUG?

解决方案

glacier< ro ******* @ gmail.comwrites:


我在这里使用中文字符作为示例。


> ;> s1 =''?? *?¥????''
repr(s1)



"'' \\xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "


>> b1 = s1.decode(''GBK'')



我的第一个问题是:'解码'用什么策略来说明

来分隔单词。我的意思是因为s1是一个多字节字符串,

它是如何确定每2字节或1字节分隔字符串的?



您指定的编解码器(GBK)与任何字符编码编解码器一样,

字符和字节之间的精确映射。几乎可以肯定

不知道单词,只有字符到字节的映射。


-

\"当我收到新信息时,我改变了我的立场。什么,先生,|

` \你用新信息吗? - John Maynard Keynes |
_o__)|

Ben Finney


Ben Finney< bi *** *************@benfinney.id.auwrites:


glacier< ro ******* @ gmail。 comwrites:


我在这里使用中文字符作为例子。


>>> s1 =''?? *?¥????''

>>> repr(s1)



" '' \\xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "


>>> b1 = s1.decode(''GBK'')



我的第一个问题是:解码使用什么策略来告诉

分离单词的方法。我的意思是因为s1是一个多字节字符串
字符串,它是如何确定每2字节分隔字符串

或1byte?



您指定的编解码器(GBK)与任何字符编码

编解码器一样,是字符和字节之间的精确映射。它几乎是

肯定不知道单词,只有字符到字节的映射。



为了清楚起见,我应该指出,我并不是要暗示静态

表格映射。字符编码中的映射通常更加复杂和算法。


当然,这并不会使它们变得不那么精确;而核心点

是一个字符映射编解码器*仅*关于获取

字符和字节之间的其他内容。


-

\最后笑的人,认为最慢。 - 匿名|

` \ |

_o__)|

Ben Finney


< blockquote> On 1 ??24è?,????1ê±41·?, Ben Finney< bignose + hates-s .... @ benfinney.id.au>

写道:


Ben Finney< bignose + hates-s ... @ benfinney.id.auwrites:


glacier< rong.x ... @ gmail.comwrites:


我在这里使用中文字符作为示例。


>> s1 =''?? o ?? e''

>> repr(s1)

"''\\xc4 \\xe3 \\\\\\\\\\\\\ xc2 \\xf0''"

>> b1 = s1.decode(''GBK'')


我的第一个问题是:解码使用什么策略来告诉

分离单词的方法。我的意思是因为s1是一个多字节字符串
字符串,它是如何确定每2字节分隔字符串

或1byte?


您指定的编解码器(GBK)就像任何字符编码一样

编解码器,字符和字节之间的精确映射。它几乎是

肯定不知道单词,只有字符到字节的映射。



为了清楚起见,我应该指出,我并不是要暗示静态

表格映射。字符编码中的映射通常更加复杂和算法。


当然,这并不会使它们变得不那么精确;而核心点

是一个字符映射编解码器*仅*关于获取

字符和字节之间的其他内容。


-

\最后笑的人,认为最慢。 - 匿名|

` \ |

_o__)|

Ben Finney-òt2?±?òyó???×? -


- ??ê?òyó?μ???×? -



感谢您的回复:)


当我在上一篇文章中提到''word''时,我的意思是字符。

根据您的回复,如果我尝试单独解码一个长的

字符串会发生什么。

我的意思是:

######################################

a =''?? o ?? e''* 100000

s1 =你'''

cur = 0

而cur < len(a):

d = min(len(a)-i,1023)

s1 + = a [cur:cur + d] .decode(''mbcs '')

cur + = d

########################### ###########


上面的代码可能会在s1中产生任何伪造的字符吗?

谢谢:)



I use chinese charactors as an example here.

>>>s1=''??o??e''
repr(s1)

"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"

>>>b1=s1.decode(''GBK'')

My first question is : what strategy does ''decode'' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

I''m not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?

解决方案

glacier <ro*******@gmail.comwrites:

I use chinese charactors as an example here.

>>s1=''??*?¥????''
repr(s1)

"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"

>>b1=s1.decode(''GBK'')


My first question is : what strategy does ''decode'' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?

The codec you specified ("GBK") is, like any character-encoding codec,
a precise mapping between characters and bytes. It''s almost certainly
not aware of "words", only character-to-byte mappings.

--
\ "When I get new information, I change my position. What, sir, |
`\ do you do with new information?" -- John Maynard Keynes |
_o__) |
Ben Finney


Ben Finney <bi****************@benfinney.id.auwrites:

glacier <ro*******@gmail.comwrites:

I use chinese charactors as an example here.

>>>s1=''??*?¥????''
>>>repr(s1)

"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"

>>>b1=s1.decode(''GBK'')

My first question is : what strategy does ''decode'' use to tell the
way to seperate the words. I mean since s1 is an multi-bytes-char
string, how did it determine to seperate the string every 2bytes
or 1byte?


The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It''s almost
certainly not aware of "words", only character-to-byte mappings.

To be clear, I should point out that I didn''t mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn''t make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney


On 1??24è?, ????1ê±41·?, Ben Finney <bignose+hates-s....@benfinney.id.au>
wrote:

Ben Finney <bignose+hates-s...@benfinney.id.auwrites:

glacier <rong.x...@gmail.comwrites:

I use chinese charactors as an example here.

>>s1=''??o??e''
>>repr(s1)
"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"
>>b1=s1.decode(''GBK'')

My first question is : what strategy does ''decode'' use to tell the
way to seperate the words. I mean since s1 is an multi-bytes-char
string, how did it determine to seperate the string every 2bytes
or 1byte?

The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It''s almost
certainly not aware of "words", only character-to-byte mappings.


To be clear, I should point out that I didn''t mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn''t make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney- òt2?±?òyó???×? -

- ??ê?òyó?μ???×? -

thanks for your respoonse:)

When I mentioned ''word'' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a=''??o??e''*100000
s1 = u''''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode(''mbcs'')
cur += d
######################################

May the code above produce any bogus characters in s1?
Thanks :)



这篇关于关于解码/编码的一些问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆