如何编写可读取doc/docx文件并将其转换为txt的python脚本? [英] How do I write a python script that can read doc/docx files and convert them to txt?

查看:166
本文介绍了如何编写可读取doc/docx文件并将其转换为txt的python脚本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我有一个包含大量.doc/.docx文件的文件夹.我需要.txt格式的文件.该脚本应遍历目录中的所有文件,将它们转换为.txt文件,并将其存储在另一个文件夹中.

Basically I have a folder with plenty of .doc/.docx files. I need them in .txt format. The script should iterate over all the files in a directory, convert them to .txt files and store them in another folder.

我该怎么办?

是否存在可以执行此操作的模块?

Does there exist a module that can do this?

推荐答案

我认为这将成为一个有趣的快速编程项目.仅在包含"Hello,world!"的简单.docx文件中对此进行了测试,但是一系列的逻辑应该为您提供一个解析更复杂文档的场所.

I figured this would make an interesting quick programming project. This has only been tested on a simple .docx file containing "Hello, world!", but the train of logic should give you a place to work from to parse more complex documents.

from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree

# command format: python3 docx_to_txt.py Hello.docx

# let's get the file name
zip_dir = sys.argv[1]
# cut off the .docx, make it a .zip
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
# make a copy of the .docx and put it in .zip
copyfile(zip_dir, zip_dir_zip_ext)
# unzip the .zip
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
# get the xml out of /word/document.xml
data = etree.parse('./temp/word/document.xml')
# we'll want to go over all 't' elements in the xml node tree.
# note that MS office uses namespaces and that the w must be defined in the namespaces dictionary args
# each :t element is the "text" of the file. that's what we're looking for
# result is a list filled with the text of each t node in the xml document model
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
# dump result into a new .txt file
with open(os.path.splitext(zip_dir)[0]+'.txt', 'w') as txt:
    # join the elements of result together since txt.write can't take lists
    joined_result = '\n'.join(result)
    # write it into the new file
    txt.write(joined_result)
# close the zip_ref file
zip_ref.close()
# get rid of our mess of working directories
rmtree('./temp')
os.remove(zip_dir_zip_ext)

我敢肯定,有一种更优雅或更pythonic的方式可以做到这一点.您需要将要转换的文件与python文件放在同一目录中.命令格式为python3 docx_to_txt.py file_name.docx

I'm sure there's a more elegant or pythonic way to accomplish this. You'll need to have the file you want to convert in the same directory as the python file. Command format is python3 docx_to_txt.py file_name.docx

这篇关于如何编写可读取doc/docx文件并将其转换为txt的python脚本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆