使用pywin32从MS Word中提取数据 [英] Pulling data out of MS Word with pywin32

查看:348
本文介绍了使用pywin32从MS Word中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows中运行python 3.3,我需要从Word文档中拉出字符串.我一直在寻找最佳方法的一个星期左右.最初,我尝试将.docx文件另存为.txt并使用RE进行解析,但是隐藏字符存在一些格式问题-我使用脚本打开.docx并另存为.txt.我想知道我是否执行了正确的File> SaveAs> .txt文件,它会去除奇怪的格式,然后可以正确解析吗?我不知道,但是我放弃了这种方法.

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.

我尝试使用 docx模块,但有人告诉我它与以下版本不兼容python 3.3.因此,我只剩下使用pywin32和COM了.我已经在Excel中成功使用了它来获取我需要的数据,但是我遇到了Word的麻烦,因为FAR的文档少了,而且

I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.

到目前为止,这是我要打开的文件:

Here is what I have so far to open the document(s):

import win32com.client as win32
import glob, os

word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True

for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
    print(infile)
    doc = word.Documents.Open(infile)

所以在这一点上我可以做类似的事情

So at this point I can do something like

print(doc.Content.Text) 

并查看文件的内容,但看起来仍然有些奇怪的格式,我也不知道如何实际解析以获取所需的数据.我可以创建可成功找到所需字符串的RE,我只是不知道如何使用COM将其实现到程序中.

And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.

到目前为止,我拥有的代码大部分是通过Google找到的.我什至不认为这很难,只是在Microsoft网站上阅读对象模型就像阅读外语一样.任何帮助深表感谢.谢谢.

The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.

我用来将文件从docx保存到txt的代码:

code I was using to save the files from docx to txt:

for path, dirs, files in os.walk(r'mypath'):
    for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
        print("processing %s" % doc)
        wordapp.Documents.Open(doc)
        docastxt = doc.rstrip('docx') + 'txt'
        wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
        wordapp.ActiveDocument.Close()

推荐答案

如果您不想学习Word建模文档的复杂方式,然后又不想通过Office对象模型来了解它,那么一个更简单的解决方案是Word保存文件的纯文本副本.

If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.

这里有很多选择.使用tempfile创建临时文本文件然后将其删除,还是将永久文本文件与doc文件一起存储以备后用?使用Unicode(在Microsoft中,这意味着带有BOM的UTF-16-LE)或编码的文本?等等.因此,我将选择一些合理的方法,您可以查看 Document.SaveAs

There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.

wdFormatUnicodeText = 7

for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
    print(infile)
    doc = word.Documents.Open(infile)
    txtpath = os.path.splitext('infile')[0] + '.txt'
    doc.SaveAs(txtpath, wdFormatUnicodeText)
    doc.Close()
    with open(txtpath, encoding='utf-16') as f:
        process_the_file(f)

正如您的评论中所指出的那样,这对表格,多列文本等复杂事物的作用可能并非您所需要的.在这种情况下,您可能需要考虑另存为例如wdFormatFilteredHTML,Python对此有很好的解析器. (BeautifulSoup表比Win32com-Word要容易得多.)

As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)

这篇关于使用pywin32从MS Word中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆