如何在Python中写入原始二进制数据? [英] How do I write raw binary data in Python?

查看:356
本文介绍了如何在Python中写入原始二进制数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Python程序,用于存储数据并将其写入文件。数据是原始二进制数据,内部存储为 str 。我正在通过utf-8编解码器将其写出。但是,我得到 UnicodeDecodeError:'charmap'编解码器无法解码位置25的字节0x8d:字符映射到 cp1252中的< undefined> 。 py 文件。

I've got a Python program that stores and writes data to a file. The data is raw binary data, stored internally as str. I'm writing it out through a utf-8 codec. However, I get UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 25: character maps to <undefined> in the cp1252.py file.

在我看来,这就像Python试图使用默认代码页来解释数据一样。但是它没有拥有默认代码页。这就是为什么我使用 str 而不是 unicode 的原因。

This looks to me like Python is trying to interpret the data using the default code page. But it doesn't have a default code page. That's why I'm using str, not unicode.

我猜我的问题是:


  • 如何用Python表示内存中的原始二进制数据?

  • 当我通过编解码器写入原始二进制数据时,如何编码/取消编码?

推荐答案

注意:这是为Python 2.x编写的。不确定是否适用于3.x。

您对 str 的原始二进制数据使用内存是正确的。

[如果您使用的是Python 2.6+,最好使用 bytes ,在2.6+中它只是 str ,但可以更好地表达您的意图,如果有一天您将代码移植到Python 3会有所帮助。]

Your use of str for raw binary data in memory is correct.
[If you're using Python 2.6+, it's even better to use bytes which in 2.6+ is just an alias to str but expresses your intention better, and will help if one day you port the code to Python 3.]

As其他人指出,通过编解码器写入二进制数据很奇怪。编写编解码器接受unicode ,然后将字节输出到文件中。您正在尝试向后进行操作,因此我们对您的意图感到困惑...

As others note, writing binary data through a codec is strange. A write codec takes unicode and outputs bytes into the file. You're trying to do it backwards, hence our confusion about your intentions...

[而且您对错误的诊断看起来是正确的:由于编解码器期望使用Unicode,因此Python正在使用系统的默认编码将str解码为unicode,这会令人窒息。]

[And your diagnosis of the error looks correct: since the codec expects unicode, Python is decoding your str into unicode with the system's default encoding, which chokes.]

您想在输出文件中看到什么?


  • 如果文件应按原样包含二进制数据

然后,您不得通过编解码器发送它;您必须将
直接写入文件。编解码器对所有内容进行编码,并且只能
发出有效的unicode编码(在您的情况下为有效的UTF-8)。
没有输入,您可以给它输入以使其发出任意的
字节序列!

Then you must not send it through a codec; you must write it directly to the file. A codec encodes everything and can only emit valid encodings of unicode (in your case, valid UTF-8). There is no input you can give it to make it emit arbitrary byte sequences!


  • 如果需要混合 UTF-8和原始二进制数据,您
    应该直接打开文件,并混合写入 some_data
    some_text.encode('utf8') ...

  • If you require a mixture of UTF-8 and raw binary data, you should open the file directly, and intermix writes of some_data with some_text.encode('utf8')...

将UTF-8与原始任意数据混合使用是
的错误设计,因为此类文件非常不便处理
!理解unicode的工具会阻塞
二进制数据,使您无法方便地查看(更不用说
修改)文件了。

Note however that mixing UTF-8 with raw arbitrary data is very bad design, because such files are very inconvenient to deal with! Tools that understand unicode will choke on the binary data, leaving you with not convenient way to even view (let alone modify) the file.

如果要友好地表示
Unicode中的任意字节

传递 data .encode('base64')到编解码器。 Base64仅产生
个干净的ascii(字母,数字和一点标点符号),因此它可以清楚地将
嵌入任何东西中,并且可以清楚地将其视为
二进制数据,并且它相当紧凑(
的开销略高于33%)。

Pass data.encode('base64') to the codec. Base64 produces only clean ascii (letters, numbers, and a little punctuation) so it can be clearly embedded in anything, it clearly looks to people as binary data, and it's reasonably compact (slightly over 33% overhead).

PS您可能会注意到 data.encode('base64')很奇怪。

P.S. you may note that data.encode('base64') is strange.


  • .encode()应该采用unicode,但是我给它一个
    字符串? Python有几种伪编码解码器,可转换str-> str
    ,例如'base64'和'zlib'。

  • .encode() is supposed to take unicode but I'm giving it a string?! Python has several pseudo-codecs that convert str->str such as 'base64' and 'zlib'.

.encode()始终返回一个str,但是您会将其馈入期望使用unicode的编解码器
中!在这种情况下,它只会包含干净的
ascii,所以没关系。如果可以使
更好,则可以显式编写
data.encode('base64')。encode('utf8')

.encode() always returns an str but you'll feed it into a codec expecting unicode?! In this case it will only contain clean ascii, so it doesn't matter. You may write explicitly data.encode('base64').encode('utf8') if it makes you feel better.

如果您需要从任意字节到unicode的1:1映射

data.decode('latin1')传递给编解码器。 latin1
字节0-255映射为Unicode字符0-255,这有点优雅。

Pass data.decode('latin1') to the codec. latin1 maps bytes 0-255 to unicode characters 0-255, which is kinda elegant.

编解码器当然会编码您的字符-128-255是
在UTF-8中编码为2或3个字节(令人惊讶的是,平均
的开销是50%,比base64多!)。

The codec will, of course, encode your characters - 128-255 are encoded as 2 or 3 bytes in UTF-8 (surprisingly, the average overhead is 50%, more than base64!). This quite kills the "elegance" of having a 1:1 mapping.

请注意,Unicode字符0-255包含讨厌的
不可见/控制字符(换行符,换页符,软连字符等)
使您的二进制数据很烦人,无法在文本编辑器中查看。

Note also that unicode characters 0-255 include nasty invisible/control characters (newline, formfeed, soft hyphen, etc.) making your binary data annoying to view in text editors.

考虑到这些缺点,我不建议您使用latin1 ,除非您清楚
为何想要它。

我只是将其作为另一个自然编码提到
的想法。

Considering these drawbacks, I do not recommend latin1 unless you understand exactly why you want it.
I'm just mentioning it as the other "natural" encoding that springs to mind.

这篇关于如何在Python中写入原始二进制数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆