在Windows中使用反词在Python中读取.doc文件(也是.docx) [英] Reading .doc file in Python using antiword in Windows (also .docx)

查看:87
本文介绍了在Windows中使用反词在Python中读取.doc文件(也是.docx)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试读取 .doc 文件,例如-

I tried reading a .doc file like -

with open('file.doc', errors='ignore') as f:
    text = f.read()

它确实读取了该文件,但是有很多垃圾,我无法删除该垃圾,因为我不知道它从哪里开始以及在哪里结束.

It did read that file but with huge junk, I can't remove that junk as I don't know from where it starts and where it ends.

我还尝试安装 textract 模块,该模块说它可以从任何文件格式读取,但是在Windows中下载时存在很多依赖关系问题.

I also tried installing textract module which says it can read from any file format but there were many dependency issues while downloading it in Windows.

因此,我还是使用 antiword 命令行实用程序来完成此操作,下面是我的答案.

So I alternately did this with antiword command line utility, my answer is below.

推荐答案

您可以使用 antiword 命令行实用程序来执行此操作,我知道你们中的大多数人都会尝试过,但是我仍然想分享

You can use antiword command line utility to do this, I know most of you would have tried it but still I wanted to share.

  • antiword 文件夹提取到 C:\ 并将路径 C:\ antiword 添加到您的 PATH 环境变量.
  • Extract the antiword folder to C:\ and add the path C:\antiword to your PATH environment variable.

以下是如何使用它,处理docx和doc文件的示例:

Here is a sample of how to use it, handling docx and doc files:

import os, docx2txt
def get_doc_text(filepath, file):
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = filepath + file
       docx_file = filepath + file + 'x'
       if not os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
          with open(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deleting
       else:
          # already a file with same name as doc exists having docx extension, 
          # which means it is a different file, so we cant read it
          print('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''
       return text

现在调用此函数:

filepath = "D:\\input\\"
files = os.listdir(filepath)
for file in files:
    text = get_doc_text(filepath, file)
    print(text)

这可能是在 Windows 上的 Python 中读取 .doc 文件的一种很好的替代方法.

This could be good alternate way to read .doc file in Python on Windows.

希望有帮助,谢谢.

这篇关于在Windows中使用反词在Python中读取.doc文件(也是.docx)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆