使用pywin32从MS Word中提取数据 [英] Pulling data out of MS Word with pywin32
问题描述
我在Windows中运行python 3.3,我需要从Word文档中拉出字符串.我一直在寻找最佳方法的一个星期左右.最初,我尝试将.docx文件另存为.txt并使用RE进行解析,但是隐藏字符存在一些格式问题-我使用脚本打开.docx并另存为.txt.我想知道我是否执行了正确的File> SaveAs> .txt文件,它会去除奇怪的格式,然后可以正确解析吗?我不知道,但是我放弃了这种方法.
I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
我尝试使用 docx模块,但有人告诉我它与以下版本不兼容python 3.3.因此,我只剩下使用pywin32和COM了.我已经在Excel中成功使用了它来获取我需要的数据,但是我遇到了Word的麻烦,因为FAR的文档少了,而且
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
到目前为止,这是我要打开的文件:
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
所以在这一点上我可以做类似的事情
So at this point I can do something like
print(doc.Content.Text)
并查看文件的内容,但看起来仍然有些奇怪的格式,我也不知道如何实际解析以获取所需的数据.我可以创建可成功找到所需字符串的RE,我只是不知道如何使用COM将其实现到程序中.
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
到目前为止,我拥有的代码大部分是通过Google找到的.我什至不认为这很难,只是在Microsoft网站上阅读对象模型就像阅读外语一样.任何帮助深表感谢.谢谢.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
我用来将文件从docx保存到txt的代码:
code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
推荐答案
如果您不想学习Word建模文档的复杂方式,然后又不想通过Office对象模型来了解它,那么一个更简单的解决方案是Word保存文件的纯文本副本.
If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
这里有很多选择.使用tempfile
创建临时文本文件然后将其删除,还是将永久文本文件与doc文件一起存储以备后用?使用Unicode(在Microsoft中,这意味着带有BOM的UTF-16-LE)或编码的文本?等等.因此,我将选择一些合理的方法,您可以查看 Document.SaveAs
,
There are a lot of options here. Use tempfile
to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs
, WdSaveFormat
, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
正如您的评论中所指出的那样,这对表格,多列文本等复杂事物的作用可能并非您所需要的.在这种情况下,您可能需要考虑另存为例如wdFormatFilteredHTML
,Python对此有很好的解析器. (BeautifulSoup表比Win32com-Word要容易得多.)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML
, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)
这篇关于使用pywin32从MS Word中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!