使用olefile从Word .doc提取文本 [英] Using olefile to extract text from Word .doc
问题描述
我只关心从.doc文件中获取文本.我在Windows 10上使用python 3.6,因此textract/antiword不在桌面上.我查看了这个问题中的其他参考文献,但它们都是较旧且与Windows 10和/或python 3.6不兼容.
I am only concerned with getting the text from .doc files. I am using python 3.6 on windows 10, so textract/antiword are off the table. I looked at other references from this question but they are all old and incompatible with windows 10 and/or python 3.6.
我的文档是一个中文和英文混合的.doc文件.我不熟悉Word如何存储其文件,并且我的计算机上没有Word.使用olefile,我可以获取文档的字节,但是我不知道如何正确遍历标题和布局以提取文本.如果我天真尝试
My document is a .doc file with a mix of Chinese and English. I am not familiar with how Word stores its files, and I don't have Word on my machine. Using olefile I was able to get the bytes of the document, but I do not know how to traverse the headers and layout correctly to extract the text. If I naively try
from olefile import OleFileIO as ofio
ole = ofio('d.doc')
stream = ole.openstream('WordDocument')
data = stream.read()
data.decode('utf-16')
>>>UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 9884-9885: illegal encoding
data[9884:9885]
>>>b'\xfa'
data[:9884].decode('utf-16')
然后,最后一行给了我大约一半的文档,以很多垃圾字符开头和结尾.我怀疑我可以继续尝试使用这种方法来逐段获取文本,但是最终我需要对许多文件执行此操作.即使我这样做,也无法想到自动化它的好方法.如何使用olefile从.doc中可靠地获取文本?
Then the last line gives me about half the doc, starting and ending with a lot of garbage characters. I suspect I could keep trying this method to get the text piece-by-piece, but I ultimately need to do this for a lot of files. Even if I did it this way, I can't think of a good way to automate it. How can I reliably get the text from a .doc using olefile?
(如果您知道可以使用我的规范的话,也可以在答案中包括olefile的替代项)
(Feel free to include alternatives to olefile in your answer as well, if you know of one that would work with my specs)
推荐答案
I am not sure, but I think that the problem is that olefile has no understanding of Word documents, only OLE "streams". So I would guess that your extracted data has more than plain text in, control characters of some kind. So I guess that's why you can't decode the data you get as UTF-16.
有些Python模块可以从doc文件转换,但是它们倾向于仅在使用命令行实用程序antiword
或catdoc
的Linux上工作.
There are Python modules to convert from doc files, but they tend to work only on Linux where they make use of the command line utilities antiword
or catdoc
.
我尝试了其他解决方案-如果问题是您没有Word的许可证,但可以安装软件,则LibreOffice可能是前进的道路.使用此命令,我将具有中文字母的Word测试文件从 doc 格式转换为 HTML :
I tried other solutions - if the issue is that you have no license for Word, but can otherwise install software, LibreOffice could be a path forward. With this command, I converted a Word test file with Chinese letters from doc format to HTML:
"c:\Program Files\LibreOffice\program\swriter.exe" --convert-to html d.doc
LibreOffice也可以转换为许多其他格式,但是HTML应该足够简单以便进一步处理.我还尝试了将catdoc
端口连接到Windows ,但是我无法处理中文字母.
LibreOffice can also convert to many other formats, but HTML should be simple enough to process further. I also tried a port of catdoc
to Windows but I couldn't get it to handle the Chinese letters.
太糟糕了,您没有安装Word,或者您可以让它为您完成工作.如果有人使用它,请将该解决方案留在这里:
Too bad you don't have Word installed, or you could have made it do the work for you. Leaving that solution here in case someone else has use for it:
import win32com.client
app = win32com.client.Dispatch("Word.Application")
try:
app.visible = False
wb = app.Documents.Open('c:/temp/d.doc')
doc = app.ActiveDocument
with open('out.txt', 'w', encoding = 'utf-16') as f:
f.write(doc.Content.Text)
except Exception as e:
print(e)
finally:
app.Quit()
这篇关于使用olefile从Word .doc提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!