Python3-无法读取docx,odt文件-UnicodeDecodeError:'utf-8'编解码器无法解码位置10的字节0xea:无效的连续字节 [英] Python3 - Cannot read docx, odt file - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 10: invalid continuation byte
问题描述
我正在尝试将大型docx文件拆分为小文件.为此,当使用以下代码在 python3.6 中读取文件时.
I am trying to split a large docx file into small files. For that when reading a file in python3.6 with the following code.
with open('h.docx', 'r') as f:
a = f.read()
它抛出此错误.
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position
10: invalid continuation byte
h.docx是使用LibreOffice Calc创建的,其中仅包含'hello world'
作为内容.我可以在Python 2.7中成功阅读此书,而不会出现任何错误.
h.docx is created using LibreOffice Calc with just 'hello world'
in it as content. I can read this successfully in Python 2.7 without any errors.
我尝试了
with open('h.docx', 'r', encoding='latin-1') as f:
a = f.read()
在这种情况下,我可以读取文件而没有任何错误.但是,当写入另一个文件时,原始内容将丢失.
In this I can read the file without any errors. But when written to another file, the original contents are lost.
也尝试过errors='surrogateescape'
,但是当写入另一个文件时,原始内容将丢失.
Also tried errors='surrogateescape'
, but when written to another file the original contents are lost.
推荐答案
这并不是真正的答案,但评论太久.您所做的只是胡说八道:您试图读取".docx"文件,就好像它是不是的文本文件一样.这是一种复杂的格式,其中将多个xml文件(可能还有其他...)串联到单个zip文件中.除非:
Not really an answer but too long for a comment. What you are doing is just non-sense: you are trying to read a ".docx" file as if it was a text file which it is not. It is a complex format where several xml files (and possibly others...) are concatenated into a single zip file. You should not even contemplate processing such a file by hand unless:
- 重大变化,例如用另一个单词替换一个单词
- 只读操作,例如研究特定的字符串
- 您想编写一个docx处理包(祝您好运)
甚至那些都不是简单的操作.
and even those would not be simple operation.
可能的事:
- 当您仅将文件视为不透明的内容时将其作为二进制文件处理,例如通过网络连接发送文件
- 使用专用库,例如 python-docx
- 在Windows下,使用Word的自动化界面让Word自己处理文件( comtypes 可以在这里提供帮助)
- process the file as a binary file when you only see it as an opaque content, for example to send it over a network connection
- use a dedicated library like python-docx
- under Windows, use the automation interface of Word to have word itself process the file (comtypes could help here)
这篇关于Python3-无法读取docx,odt文件-UnicodeDecodeError:'utf-8'编解码器无法解码位置10的字节0xea:无效的连续字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!