在Windows中使用反词在Python中读取.doc文件(也是.docx) [英] Reading .doc file in Python using antiword in Windows (also .docx)
问题描述
我尝试读取 .doc
文件,例如-
I tried reading a .doc
file like -
with open('file.doc', errors='ignore') as f:
text = f.read()
它确实读取了该文件,但是有很多垃圾,我无法删除该垃圾,因为我不知道它从哪里开始以及在哪里结束.
It did read that file but with huge junk, I can't remove that junk as I don't know from where it starts and where it ends.
我还尝试安装 textract
模块,该模块说它可以从任何文件格式读取,但是在Windows中下载时存在很多依赖关系问题.
I also tried installing textract
module which says it can read from any file format but there were many dependency issues while downloading it in Windows.
因此,我还是使用 antiword
命令行实用程序来完成此操作,下面是我的答案.
So I alternately did this with antiword
command line utility, my answer is below.
推荐答案
您可以使用 antiword
命令行实用程序来执行此操作,我知道你们中的大多数人都会尝试过,但是我仍然想分享
You can use antiword
command line utility to do this, I know most of you would have tried it but still I wanted to share.
- 将
antiword
文件夹提取到C:\
并将路径C:\ antiword
添加到您的PATH
环境变量.
- Extract the
antiword
folder toC:\
and add the pathC:\antiword
to yourPATH
environment variable.
以下是如何使用它,处理docx和doc文件的示例:
Here is a sample of how to use it, handling docx and doc files:
import os, docx2txt
def get_doc_text(filepath, file):
if file.endswith('.docx'):
text = docx2txt.process(file)
return text
elif file.endswith('.doc'):
# converting .doc to .docx
doc_file = filepath + file
docx_file = filepath + file + 'x'
if not os.path.exists(docx_file):
os.system('antiword ' + doc_file + ' > ' + docx_file)
with open(docx_file) as f:
text = f.read()
os.remove(docx_file) #docx_file was just to read, so deleting
else:
# already a file with same name as doc exists having docx extension,
# which means it is a different file, so we cant read it
print('Info : file with same name of doc exists having docx extension, so we cant read it')
text = ''
return text
现在调用此函数:
filepath = "D:\\input\\"
files = os.listdir(filepath)
for file in files:
text = get_doc_text(filepath, file)
print(text)
这可能是在 Windows
上的 Python
中读取 .doc
文件的一种很好的替代方法.
This could be good alternate way to read .doc
file in Python
on Windows
.
希望有帮助,谢谢.
这篇关于在Windows中使用反词在Python中读取.doc文件(也是.docx)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!