用python读取.doc文件 [英] Read .doc file with python

查看:47
本文介绍了用python读取.doc文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我参加了工作申请测试,我的交易是读取一些 .doc 文件.有谁知道图书馆可以做到这一点?我从原始的 python 代码开始:

I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:

f = open('test.doc', 'r')
f.read()

但这不会返回一个友好的字符串,我需要将其转换为 utf-8

but this does not return a friendly string I need to convert it to utf-8

我只想从这个文件中获取文本

I just want get the text from this file

推荐答案

可以使用texttract图书馆.它同时处理doc"和docx"

One can use the textract library. It take care of both "doc" as well as "docx"

import textract
text = textract.process("path/to/file.extension")

您甚至可以使用antiword"(sudo apt-get install antiword),然后先将 doc 转换为 docx,然后通读 docx2txt.

You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.

antiword filename.doc > filename.docx

最终,后端的 textract 正在使用 antiword.

Ultimately, textract in the backend is using antiword.

这篇关于用python读取.doc文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆