关于解码/编码的一些问题 [英] Some questions about decode/encode
问题描述
我在这里使用中文字符作为例子。
>>> s1 =''?? o ?? e''
repr(s1)
"''\ \xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "
>>> b1 = s1.decode(''GBK'')
我的第一个问题是:解码使用什么策略来说明
分开的话。我的意思是因为s1是一个多字节字符串,
它是如何确定每2字节或1字节分隔字符串的?
我的第二个问题是:有没有一个谁测试了很长的mbcs
解码?我试图解码一个长的(20 + MB)xml昨天,结果是
非常奇怪,导致SAX无法解析解码后的字符串。
但是,我使用另一个文本编辑器将文件转换为utf-8和
SAX将成功解析内容。
我不确定是否有一些特殊字节数组或太长的文字导致了这个问题。或许这就是python 2.5的BUG?
glacier< ro ******* @ gmail.comwrites:
我在这里使用中文字符作为示例。
> ;> s1 =''?? *?¥????''
repr(s1)
"'' \\xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "
>> b1 = s1.decode(''GBK'')
我的第一个问题是:'解码'用什么策略来说明
来分隔单词。我的意思是因为s1是一个多字节字符串,
它是如何确定每2字节或1字节分隔字符串的?
您指定的编解码器(GBK)与任何字符编码编解码器一样,
字符和字节之间的精确映射。几乎可以肯定
不知道单词,只有字符到字节的映射。
-
\"当我收到新信息时,我改变了我的立场。什么,先生,|
` \你用新信息吗? - John Maynard Keynes |
_o__)|
Ben Finney
Ben Finney< bi *** *************@benfinney.id.auwrites:
glacier< ro ******* @ gmail。 comwrites:
我在这里使用中文字符作为例子。
>>> s1 =''?? *?¥????''
>>> repr(s1)
" '' \\xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "
>>> b1 = s1.decode(''GBK'')
我的第一个问题是:解码使用什么策略来告诉
分离单词的方法。我的意思是因为s1是一个多字节字符串
字符串,它是如何确定每2字节分隔字符串
或1byte?
您指定的编解码器(GBK)与任何字符编码
编解码器一样,是字符和字节之间的精确映射。它几乎是
肯定不知道单词,只有字符到字节的映射。
为了清楚起见,我应该指出,我并不是要暗示静态
表格映射。字符编码中的映射通常更加复杂和算法。
当然,这并不会使它们变得不那么精确;而核心点
是一个字符映射编解码器*仅*关于获取
字符和字节之间的其他内容。
-
\最后笑的人,认为最慢。 - 匿名|
` \ |
_o__)|
Ben Finney
< blockquote> On 1 ??24è?,????1ê±41·?, Ben Finney< bignose + hates-s .... @ benfinney.id.au>
写道:
Ben Finney< bignose + hates-s ... @ benfinney.id.auwrites:
glacier< rong.x ... @ gmail.comwrites:
我在这里使用中文字符作为示例。
>> s1 =''?? o ?? e''
>> repr(s1)
"''\\xc4 \\xe3 \\\\\\\\\\\\\ xc2 \\xf0''"
>> b1 = s1.decode(''GBK'')
我的第一个问题是:解码使用什么策略来告诉
分离单词的方法。我的意思是因为s1是一个多字节字符串
字符串,它是如何确定每2字节分隔字符串
或1byte?
您指定的编解码器(GBK)就像任何字符编码一样
编解码器,字符和字节之间的精确映射。它几乎是
肯定不知道单词,只有字符到字节的映射。
为了清楚起见,我应该指出,我并不是要暗示静态
表格映射。字符编码中的映射通常更加复杂和算法。
当然,这并不会使它们变得不那么精确;而核心点
是一个字符映射编解码器*仅*关于获取
字符和字节之间的其他内容。
-
\最后笑的人,认为最慢。 - 匿名|
` \ |
_o__)|
Ben Finney-òt2?±?òyó???×? -
- ??ê?òyó?μ???×? -
感谢您的回复:)
当我在上一篇文章中提到''word''时,我的意思是字符。
根据您的回复,如果我尝试单独解码一个长的
字符串会发生什么。
我的意思是:
######################################
a =''?? o ?? e''* 100000
s1 =你'''
cur = 0
而cur < len(a):
d = min(len(a)-i,1023)
s1 + = a [cur:cur + d] .decode(''mbcs '')
cur + = d
########################### ###########
上面的代码可能会在s1中产生任何伪造的字符吗?
谢谢:)
I use chinese charactors as an example here.
>>>s1=''??o??e''
repr(s1)
"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"
>>>b1=s1.decode(''GBK'')
My first question is : what strategy does ''decode'' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.
I''m not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?
glacier <ro*******@gmail.comwrites:
I use chinese charactors as an example here.
>>s1=''??*?¥????''
repr(s1)
"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"
>>b1=s1.decode(''GBK'')
My first question is : what strategy does ''decode'' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?The codec you specified ("GBK") is, like any character-encoding codec,
a precise mapping between characters and bytes. It''s almost certainly
not aware of "words", only character-to-byte mappings.
--
\ "When I get new information, I change my position. What, sir, |
`\ do you do with new information?" -- John Maynard Keynes |
_o__) |
Ben Finney
Ben Finney <bi****************@benfinney.id.auwrites:
glacier <ro*******@gmail.comwrites:
I use chinese charactors as an example here.
>>>s1=''??*?¥????''
>>>repr(s1)"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"
>>>b1=s1.decode(''GBK'')My first question is : what strategy does ''decode'' use to tell the
way to seperate the words. I mean since s1 is an multi-bytes-char
string, how did it determine to seperate the string every 2bytes
or 1byte?
The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It''s almost
certainly not aware of "words", only character-to-byte mappings.To be clear, I should point out that I didn''t mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.
That doesn''t make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.
--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney
On 1??24è?, ????1ê±41·?, Ben Finney <bignose+hates-s....@benfinney.id.au>
wrote:Ben Finney <bignose+hates-s...@benfinney.id.auwrites:glacier <rong.x...@gmail.comwrites:
I use chinese charactors as an example here.
>>s1=''??o??e''
>>repr(s1)
"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"
>>b1=s1.decode(''GBK'')
My first question is : what strategy does ''decode'' use to tell the
way to seperate the words. I mean since s1 is an multi-bytes-char
string, how did it determine to seperate the string every 2bytes
or 1byte?
The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It''s almost
certainly not aware of "words", only character-to-byte mappings.
To be clear, I should point out that I didn''t mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.
That doesn''t make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.
--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney- òt2?±?òyó???×? -
- ??ê?òyó?μ???×? -thanks for your respoonse:)
When I mentioned ''word'' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a=''??o??e''*100000
s1 = u''''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode(''mbcs'')
cur += d
######################################
May the code above produce any bogus characters in s1?
Thanks :)
这篇关于关于解码/编码的一些问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!