使用olefile从Word .doc提取文本 [英] Using olefile to extract text from Word .doc

查看:184
本文介绍了使用olefile从Word .doc提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只关心从.doc文件中获取文本.我在Windows 10上使用python 3.6,因此textract/antiword不在桌面上.我查看了这个问题中的其他参考文献,但它们都是较旧且与Windows 10和/或python 3.6不兼容.

I am only concerned with getting the text from .doc files. I am using python 3.6 on windows 10, so textract/antiword are off the table. I looked at other references from this question but they are all old and incompatible with windows 10 and/or python 3.6.

我的文档是一个中文和英文混合的.doc文件.我不熟悉Word如何存储其文件,并且我的计算机上没有Word.使用olefile,我可以获取文档的字节,但是我不知道如何正确遍历标题和布局以提取文本.如果我天真尝试

My document is a .doc file with a mix of Chinese and English. I am not familiar with how Word stores its files, and I don't have Word on my machine. Using olefile I was able to get the bytes of the document, but I do not know how to traverse the headers and layout correctly to extract the text. If I naively try

from olefile import OleFileIO as ofio
ole = ofio('d.doc')
stream = ole.openstream('WordDocument')
data = stream.read()
data.decode('utf-16')
>>>UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9884-9885: illegal encoding
data[9884:9885]
>>>b'\xfa'
data[:9884].decode('utf-16')

然后,最后一行给了我大约一半的文档,以很多垃圾字符开头和结尾.我怀疑我可以继续尝试使用这种方法来逐段获取文本,但是最终我需要对许多文件执行此操作.即使我这样做,也无法想到自动化它的好方法.如何使用olefile从.doc中可靠地获取文本?

Then the last line gives me about half the doc, starting and ending with a lot of garbage characters. I suspect I could keep trying this method to get the text piece-by-piece, but I ultimately need to do this for a lot of files. Even if I did it this way, I can't think of a good way to automate it. How can I reliably get the text from a .doc using olefile?

(如果您知道可以使用我的规范的话,也可以在答案中包括olefile的替代项)

(Feel free to include alternatives to olefile in your answer as well, if you know of one that would work with my specs)

推荐答案

我不确定,但是我认为问题在于

I am not sure, but I think that the problem is that olefile has no understanding of Word documents, only OLE "streams". So I would guess that your extracted data has more than plain text in, control characters of some kind. So I guess that's why you can't decode the data you get as UTF-16.

有些Python模块可以从doc文件转换,但是它们倾向于仅在使用命令行实用程序antiwordcatdoc的Linux上工作.

There are Python modules to convert from doc files, but they tend to work only on Linux where they make use of the command line utilities antiword or catdoc.

我尝试了其他解决方案-如果问题是您没有Word的许可证,但可以安装软件,则LibreOffice可能是前进的道路.使用此命令,我将具有中文字母的Word测试文件从 doc 格式转换为 HTML :

I tried other solutions - if the issue is that you have no license for Word, but can otherwise install software, LibreOffice could be a path forward. With this command, I converted a Word test file with Chinese letters from doc format to HTML:

"c:\Program Files\LibreOffice\program\swriter.exe" --convert-to html d.doc

LibreOffice也可以转换为许多其他格式,但是HTML应该足够简单以便进一步处理.我还尝试了catdoc端口连接到Windows ,但是我无法处理中文字母.

LibreOffice can also convert to many other formats, but HTML should be simple enough to process further. I also tried a port of catdoc to Windows but I couldn't get it to handle the Chinese letters.


太糟糕了,您没有安装Word,或者您可以让它为您完成工作.如果有人使用它,请将该解决方案留在这里:


Too bad you don't have Word installed, or you could have made it do the work for you. Leaving that solution here in case someone else has use for it:

import win32com.client

app = win32com.client.Dispatch("Word.Application")

try:
    app.visible = False
    wb = app.Documents.Open('c:/temp/d.doc')
    doc = app.ActiveDocument

    with open('out.txt', 'w', encoding = 'utf-16') as f:
        f.write(doc.Content.Text)

except Exception as e:
    print(e)

finally:
    app.Quit()

这篇关于使用olefile从Word .doc提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆