使用olefile从Word .doc提取文本 [英] Using olefile to extract text from Word .doc

查看：184 发布时间：2020/5/13 1:47:12 python windows ms-word

本文介绍了使用olefile从Word .doc提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我只关心从.doc文件中获取文本.我在Windows 10上使用python 3.6，因此textract/antiword不在桌面上.我查看了这个问题中的其他参考文献，但它们都是较旧且与Windows 10和/或python 3.6不兼容.

I am only concerned with getting the text from .doc files. I am using python 3.6 on windows 10, so textract/antiword are off the table. I looked at other references from this question but they are all old and incompatible with windows 10 and/or python 3.6.

我的文档是一个中文和英文混合的.doc文件.我不熟悉Word如何存储其文件，并且我的计算机上没有Word.使用olefile，我可以获取文档的字节，但是我不知道如何正确遍历标题和布局以提取文本.如果我天真尝试

My document is a .doc file with a mix of Chinese and English. I am not familiar with how Word stores its files, and I don't have Word on my machine. Using olefile I was able to get the bytes of the document, but I do not know how to traverse the headers and layout correctly to extract the text. If I naively try

from olefile import OleFileIO as ofio
ole = ofio('d.doc')
stream = ole.openstream('WordDocument')
data = stream.read()
data.decode('utf-16')
>>>UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9884-9885: illegal encoding
data[9884:9885]
>>>b'\xfa'
data[:9884].decode('utf-16')

然后，最后一行给了我大约一半的文档，以很多垃圾字符开头和结尾.我怀疑我可以继续尝试使用这种方法来逐段获取文本，但是最终我需要对许多文件执行此操作.即使我这样做，也无法想到自动化它的好方法.如何使用olefile从.doc中可靠地获取文本?

Then the last line gives me about half the doc, starting and ending with a lot of garbage characters. I suspect I could keep trying this method to get the text piece-by-piece, but I ultimately need to do this for a lot of files. Even if I did it this way, I can't think of a good way to automate it. How can I reliably get the text from a .doc using olefile?

(如果您知道可以使用我的规范的话，也可以在答案中包括olefile的替代项)

(Feel free to include alternatives to olefile in your answer as well, if you know of one that would work with my specs)

推荐答案

我不确定，但是我认为问题在于

I am not sure, but I think that the problem is that olefile has no understanding of Word documents, only OLE "streams". So I would guess that your extracted data has more than plain text in, control characters of some kind. So I guess that's why you can't decode the data you get as UTF-16.

有些Python模块可以从doc文件转换，但是它们倾向于仅在使用命令行实用程序antiword或catdoc的Linux上工作.

There are Python modules to convert from doc files, but they tend to work only on Linux where they make use of the command line utilities antiword or catdoc.

我尝试了其他解决方案-如果问题是您没有Word的许可证，但可以安装软件，则LibreOffice可能是前进的道路.使用此命令，我将具有中文字母的Word测试文件从 doc 格式转换为 HTML :

I tried other solutions - if the issue is that you have no license for Word, but can otherwise install software, LibreOffice could be a path forward. With this command, I converted a Word test file with Chinese letters from doc format to HTML:

"c:\Program Files\LibreOffice\program\swriter.exe" --convert-to html d.doc

LibreOffice也可以转换为许多其他格式，但是HTML应该足够简单以便进一步处理.我还尝试了将catdoc端口连接到Windows ，但是我无法处理中文字母.

LibreOffice can also convert to many other formats, but HTML should be simple enough to process further. I also tried a port of catdoc to Windows but I couldn't get it to handle the Chinese letters.

太糟糕了，您没有安装Word，或者您可以让它为您完成工作.如果有人使用它，请将该解决方案留在这里:

Too bad you don't have Word installed, or you could have made it do the work for you. Leaving that solution here in case someone else has use for it:

import win32com.client

app = win32com.client.Dispatch("Word.Application")

try:
    app.visible = False
    wb = app.Documents.Open('c:/temp/d.doc')
    doc = app.ActiveDocument

    with open('out.txt', 'w', encoding = 'utf-16') as f:
        f.write(doc.Content.Text)

except Exception as e:
    print(e)

finally:
    app.Quit()

这篇关于使用olefile从Word .doc提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用olefile从Word .doc提取文本 [英] Using olefile to extract text from Word .doc

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用olefile从Word .doc提取文本 [英] Using olefile to extract text from Word .doc

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭