关于解码/编码的一些问题 [英] Some questions about decode/encode

查看：86 发布时间：2019/6/5 9:09:42 python

本文介绍了关于解码/编码的一些问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在这里使用中文字符作为例子。

>>> s1 =''?? o ?? e''
repr（s1）

"''\ \xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "

>>> b1 = s1.decode（''GBK''）

我的第一个问题是：解码使用什么策略来说明

分开的话。我的意思是因为s1是一个多字节字符串，

它是如何确定每2字节或1字节分隔字符串的？

我的第二个问题是：有没有一个谁测试了很长的mbcs

解码？我试图解码一个长的（20 + MB）xml昨天，结果是

非常奇怪，导致SAX无法解析解码后的字符串。

但是，我使用另一个文本编辑器将文件转换为utf-8和

SAX将成功解析内容。

我不确定是否有一些特殊字节数组或太长的文字导致了这个问题。或许这就是python 2.5的BUG？

解决方案

glacier< ro ******* @ gmail.comwrites：

我在这里使用中文字符作为示例。

> ;> s1 =''?? *？¥????''
repr（s1）

"'' \\xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "

>> b1 = s1.decode（''GBK''）

我的第一个问题是：'解码'用什么策略来说明

来分隔单词。我的意思是因为s1是一个多字节字符串，

它是如何确定每2字节或1字节分隔字符串的？

您指定的编解码器（GBK）与任何字符编码编解码器一样，

字符和字节之间的精确映射。几乎可以肯定

不知道单词，只有字符到字节的映射。

-

\"当我收到新信息时，我改变了我的立场。什么，先生，|

` \你用新信息吗？ - John Maynard Keynes |
_o__）|

Ben Finney

Ben Finney< bi *** *************@benfinney.id.auwrites：

glacier< ro ******* @ gmail。 comwrites：

我在这里使用中文字符作为例子。

>>> s1 =''?? *？¥????''

>>> repr（s1）

" '' \\xc4\\xe3\\xba\\xc3\\xc2\\xf0 '' "

>>> b1 = s1.decode（''GBK''）

我的第一个问题是：解码使用什么策略来告诉

分离单词的方法。我的意思是因为s1是一个多字节字符串
字符串，它是如何确定每2字节分隔字符串

或1byte？

您指定的编解码器（GBK）与任何字符编码

编解码器一样，是字符和字节之间的精确映射。它几乎是

肯定不知道单词，只有字符到字节的映射。

为了清楚起见，我应该指出，我并不是要暗示静态

表格映射。字符编码中的映射通常更加复杂和算法。

当然，这并不会使它们变得不那么精确;而核心点

是一个字符映射编解码器*仅*关于获取

字符和字节之间的其他内容。

-

\最后笑的人，认为最慢。 - 匿名|

` \ |

_o__）|

Ben Finney

< blockquote> On 1 ??24è？，????1ê±41·?, Ben Finney< bignose + hates-s .... @ benfinney.id.au>

写道：

Ben Finney< bignose + hates-s ... @ benfinney.id.auwrites：

glacier< rong.x ... @ gmail.comwrites：

我在这里使用中文字符作为示例。

>> s1 =''?? o ?? e''

>> repr（s1）

"''\\xc4 \\xe3 \\\\\\\\\\\\\ xc2 \\xf0''"

>> b1 = s1.decode（''GBK''）

我的第一个问题是：解码使用什么策略来告诉

分离单词的方法。我的意思是因为s1是一个多字节字符串
字符串，它是如何确定每2字节分隔字符串

或1byte？

您指定的编解码器（GBK）就像任何字符编码一样

编解码器，字符和字节之间的精确映射。它几乎是

肯定不知道单词，只有字符到字节的映射。

为了清楚起见，我应该指出，我并不是要暗示静态

表格映射。字符编码中的映射通常更加复杂和算法。

当然，这并不会使它们变得不那么精确;而核心点

是一个字符映射编解码器*仅*关于获取

字符和字节之间的其他内容。

-

\最后笑的人，认为最慢。 - 匿名|

` \ |

_o__）|

Ben Finney-òt2？±？òyó???×？ -

- ??ê？òyó？μ???×？ -

感谢您的回复:)

当我在上一篇文章中提到''word''时，我的意思是字符。

根据您的回复，如果我尝试单独解码一个长的

字符串会发生什么。

我的意思是：

######################################

a =''?? o ?? e''* 100000

s1 =你'''

cur = 0

而cur < len（a）：

d = min（len（a）-i，1023）

s1 + = a [cur：cur + d] .decode（''mbcs ''）

cur + = d

########################### ###########

上面的代码可能会在s1中产生任何伪造的字符吗？

谢谢:)

I use chinese charactors as an example here.

>>>s1=''??o??e''
repr(s1)

"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"

>>>b1=s1.decode(''GBK'')

My first question is : what strategy does ''decode'' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?
My second question is: is there any one who has tested very long mbcs
decode? I tried to decode a long(20+MB) xml yesterday, which turns out
to be very strange and cause SAX fail to parse the decoded string.
However, I use another text editor to convert the file to utf-8 and
SAX will parse the content successfully.

I''m not sure if some special byte array or too long text caused this
problem. Or maybe thats a BUG of python 2.5?

解决方案

glacier <ro*******@gmail.comwrites:

I use chinese charactors as an example here.

>>s1=''??*?￥????''
repr(s1)

"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"

>>b1=s1.decode(''GBK'')

My first question is : what strategy does ''decode'' use to tell the way
to seperate the words. I mean since s1 is an multi-bytes-char string,
how did it determine to seperate the string every 2bytes or 1byte?
The codec you specified ("GBK") is, like any character-encoding codec,
a precise mapping between characters and bytes. It''s almost certainly
not aware of "words", only character-to-byte mappings.

--
\ "When I get new information, I change my position. What, sir, |
`\ do you do with new information?" -- John Maynard Keynes |
_o__) |
Ben Finney

Ben Finney <bi****************@benfinney.id.auwrites:

glacier <ro*******@gmail.comwrites:

I use chinese charactors as an example here.

>>>s1=''??*?￥????''
>>>repr(s1)
"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"
>>>b1=s1.decode(''GBK'')
My first question is : what strategy does ''decode'' use to tell the
way to seperate the words. I mean since s1 is an multi-bytes-char
string, how did it determine to seperate the string every 2bytes
or 1byte?

The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It''s almost
certainly not aware of "words", only character-to-byte mappings.
To be clear, I should point out that I didn''t mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn''t make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney

On 1??24è?, ????1ê±41·?, Ben Finney <bignose+hates-s....@benfinney.id.au>
wrote:
Ben Finney <bignose+hates-s...@benfinney.id.auwrites:
glacier <rong.x...@gmail.comwrites:

I use chinese charactors as an example here.

>>s1=''??o??e''
>>repr(s1)
"''\\xc4\\xe3\\xba\\xc3\\xc2\\xf0''"
>>b1=s1.decode(''GBK'')

My first question is : what strategy does ''decode'' use to tell the
way to seperate the words. I mean since s1 is an multi-bytes-char
string, how did it determine to seperate the string every 2bytes
or 1byte?

The codec you specified ("GBK") is, like any character-encoding
codec, a precise mapping between characters and bytes. It''s almost
certainly not aware of "words", only character-to-byte mappings.

To be clear, I should point out that I didn''t mean to imply static
tabular mappings only. The mappings in a character encoding are often
more complex and algorithmic.

That doesn''t make them any less precise, of course; and the core point
is that a character-mapping codec is *only* about getting between
characters and bytes, nothing else.

--
\ "He who laughs last, thinks slowest." -- Anonymous |
`\ |
_o__) |
Ben Finney- òt2?±?òyó???×? -

- ??ê?òyó?μ???×? -
thanks for your respoonse:)

When I mentioned ''word'' in the previous post, I mean character.
According to your reply, what will happen if I try to decode a long
string seperately.
I mean:
######################################
a=''??o??e''*100000
s1 = u''''
cur = 0
while cur < len(a):
d = min(len(a)-i,1023)
s1 += a[cur:cur+d].decode(''mbcs'')
cur += d
######################################

May the code above produce any bogus characters in s1?
Thanks :)

这篇关于关于解码/编码的一些问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

关于解码/编码的一些问题 [英] Some questions about decode/encode

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

关于解码/编码的一些问题 [英] Some questions about decode/encode

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭