用python读取.doc文件 [英] Read .doc file with python
问题描述
我参加了工作申请测试,我的交易是读取一些 .doc 文件.有谁知道图书馆可以做到这一点?我从原始的 python 代码开始:
I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:
f = open('test.doc', 'r')
f.read()
但这不会返回一个友好的字符串,我需要将其转换为 utf-8
but this does not return a friendly string I need to convert it to utf-8
我只想从这个文件中获取文本
I just want get the text from this file
推荐答案
可以使用texttract>图书馆.它同时处理doc"和docx"
One can use the textract library. It take care of both "doc" as well as "docx"
import textract
text = textract.process("path/to/file.extension")
您甚至可以使用antiword"(sudo apt-get install antiword),然后先将 doc 转换为 docx,然后通读 docx2txt.
You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.
antiword filename.doc > filename.docx
最终,后端的 textract 正在使用 antiword.
Ultimately, textract in the backend is using antiword.
这篇关于用python读取.doc文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!