Python3-无法读取docx,odt文件-UnicodeDecodeError:'utf-8'编解码器无法解码位置10的字节0xea:无效的连续字节 [英] Python3 - Cannot read docx, odt file - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 10: invalid continuation byte

查看:326
本文介绍了Python3-无法读取docx,odt文件-UnicodeDecodeError:'utf-8'编解码器无法解码位置10的字节0xea:无效的连续字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将大型docx文件拆分为小文件.为此,当使用以下代码在 python3.6 中读取文件时.

I am trying to split a large docx file into small files. For that when reading a file in python3.6 with the following code.

with open('h.docx', 'r') as f:
    a = f.read()

它抛出此错误.

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 
  10: invalid continuation byte

h.docx是使用LibreOffice Calc创建的,其中仅包含'hello world'作为内容.我可以在Python 2.7中成功阅读此书,而不会出现任何错误.

h.docx is created using LibreOffice Calc with just 'hello world' in it as content. I can read this successfully in Python 2.7 without any errors.

我尝试了

with open('h.docx', 'r', encoding='latin-1') as f:
    a = f.read()

在这种情况下,我可以读取文件而没有任何错误.但是,当写入另一个文件时,原始内容将丢失.

In this I can read the file without any errors. But when written to another file, the original contents are lost.

也尝试过errors='surrogateescape',但是当写入另一个文件时,原始内容将丢失.

Also tried errors='surrogateescape', but when written to another file the original contents are lost.

推荐答案

这并不是真正的答案,但评论太久.您所做的只是胡说八道:您试图读取".docx"文件,就好像它是不是的文本文件一样.这是一种复杂的格式,其中将多个xml文件(可能还有其他...)串联到单个zip文件中.除非:

Not really an answer but too long for a comment. What you are doing is just non-sense: you are trying to read a ".docx" file as if it was a text file which it is not. It is a complex format where several xml files (and possibly others...) are concatenated into a single zip file. You should not even contemplate processing such a file by hand unless:

  • 重大变化,例如用另一个单词替换一个单词
  • 只读操作,例如研究特定的字符串
  • 您想编写一个docx处理包(祝您好运)

甚至那些都不是简单的操作.

and even those would not be simple operation.

可能的事:

  • 当您仅将文件视为不透明的内容时将其作为二进制文件处理,例如通过网络连接发送文件
  • 使用专用库,例如​​ python-docx
  • 在Windows下,使用Word的自动化界面让Word自己处理文件( comtypes 可以在这里提供帮助)
  • process the file as a binary file when you only see it as an opaque content, for example to send it over a network connection
  • use a dedicated library like python-docx
  • under Windows, use the automation interface of Word to have word itself process the file (comtypes could help here)

这篇关于Python3-无法读取docx,odt文件-UnicodeDecodeError:'utf-8'编解码器无法解码位置10的字节0xea:无效的连续字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆