在.docx文件中查找和替换文本-Python [英] Find and replace text in .docx file - Python

查看:158
本文介绍了在.docx文件中查找和替换文本-Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力寻找一种方法来查找和替换docx文件中的文本,但运气不佳.我尝试了docx模块,但无法正常工作.最终,我使用zipfile模块制定了下面描述的方法,并替换了docx存档中的document.xml文件.为此,您需要一个模板文档(docx),其文本要替换为唯一的字符串,该字符串不能与文档中任何其他现有或将来的文本匹配(例如,与XXXMEETDATEXXX上的XXXCLIENTNAMEXXX的会议进行得非常顺利.).

I've been doing a lot of searching for a method to find and replace text in a docx file with little luck. I've tried the docx module and could not get that to work. Eventually I worked out the method described below using the zipfile module and replacing the document.xml file in the docx archive. For this to work you need a template document (docx) with the text you want to replace as unique strings that could not possibly match any other existing or future text in the document (eg. "The meeting with XXXCLIENTNAMEXXX on XXXMEETDATEXXX went very well.").

import zipfile

replaceText = {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}
templateDocx = zipfile.ZipFile("C:/Template.docx")
newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a")

with open(templateDocx.extract("word/document.xml", "C:/")) as tempXmlFile:
    tempXmlStr = tempXmlFile.read()

for key in replaceText.keys():
    tempXmlStr = tempXmlStr.replace(str(key), str(replaceText.get(key)))

with open("C:/temp.xml", "w+") as tempXmlFile:
    tempXmlFile.write(tempXmlStr)

for file in templateDocx.filelist:
    if not file.filename == "word/document.xml":
        newDocx.writestr(file.filename, templateDocx.read(file))

newDocx.write("C:/temp.xml", "word/document.xml")

templateDocx.close()
newDocx.close()

我的问题是这种方法有什么问题?我对这些东西还很陌生,所以我觉得应该已经有人解决了.这使我相信这种方法存在很大的问题.但这有效!我在这里想念什么?

My question is what's wrong with this method? I'm pretty new to this stuff, so I feel someone else should have figured this out already. Which leads me to believe there is something very wrong with this approach. But it works! What am I missing here?

.

以下是我所有其他尝试学习此知识的人的思考过程:

Here is a walkthrough of my thought process for everyone else trying to learn this stuff:

步骤1)准备一个Python字典,将要替换的文本字符串作为键,将新文本作为项(例如{"XXXCLIENTNAMEXXX":"Joe Bob","XXXMEETDATEXXX":"2013年5月31日")}).

Step 1) Prepare a Python dictionary of the text strings you want to replace as keys and the new text as items (eg. {"XXXCLIENTNAMEXXX" : "Joe Bob", "XXXMEETDATEXXX" : "May 31, 2013"}).

第2步)使用zipfile模块打开模板docx文件.

Step 2) Open the template docx file using the zipfile module.

第3步)使用附加访问模式打开一个新的新docx文件.

Step 3) Open a new new docx file with the append access mode.

第4步)从模板docx文件中提取document.xml(所有文本都保存在其中),并将xml读取为文本字符串变量.

Step 4) Extract the document.xml (where all the text lives) from the template docx file and read the xml to a text string variable.

第5步)使用for循环将xml文本字符串中字典中定义的所有文本替换为新文本.

Step 5) Use a for loop to replace all of the text defined in your dictionary in the xml text string with your new text.

步骤6)将xml文本字符串写入新的临时xml文件.

Step 6) Write the xml text string to a new temporary xml file.

第7步)使用for循环和zipfile模块将模板docx归档文件中的所有文件复制到新的docx归档文件中(不包括word/document.xml文件).

Step 7) Use a for loop and the zipfile module to copy all of the files in the template docx archive to a new docx archive EXCEPT the word/document.xml file.

第8步),将带有替换文本的临时xml文件作为新的word/document.xml文件写入新的docx存档中.

Step 8) Write the temporary xml file with the replaced text to the new docx archive as a new word/document.xml file.

第9步)关闭模板和新的docx存档.

Step 9) Close your template and new docx archives.

第10步)打开新的docx文档,并享受替换后的文本!

Step 10) Open your new docx document and enjoy your replaced text!

-编辑-在第7和11行上缺少右括号')'

--Edit-- Missing closing parentheses ')' on lines 7 and 11

推荐答案

有时候,Word做一些奇怪的事情.您应该尝试删除文本并一键重写,例如不要在中间编辑文本

Sometimes, Word does strange things. You should try to remove the text and rewrite it in one stroke, eg without editing the text in the middle

您的文档保存在xml文件中(通常在解压缩后保存在docx的word/document.xml中).有时,您的文字可能不会一stroke而就:文档中的某个位置可能是XXXCLIENT,而其他位置可能是NAMEXXX.

Your document is saved in a xml file (usually in word/document.xml for docx, afer unzipping). Sometimes it is possible that your text won't be in one stroke: it is possible that somewhere in the document, they is XXXCLIENT and somewhere else they is NAMEXXX.

类似这样的东西:

< w:t>XXXCLIENT</w:t>...< w:t>NAMEXXX</w:t>

这种情况经常由于语言支持而发生:当单词认为一个单词属于一种特定语言时,单词会拆分单词,并且可能在单词之间进行拆分,这会将单词拆分为多个标签.

This happens quite often because of language support : word splits words when he thinks that one word is of one specific language, and may do so between words, that will split the words into multiple tags.

解决方案的唯一问题是,您必须一口气编写所有内容,这并不是最友好的操作方式.

Only problem with your solution is that you have to write everything in one stroke, which isn't the most user-friendly.

我创建了一个使用像胡子这样的标记的JS库:{clientName} https://github.com/edi9999/docxgenjs

I have created a JS Library that uses mustache like tags: {clientName} https://github.com/edi9999/docxgenjs

它在全局范围内与算法相同,但是如果内容不是一键便不会崩溃(当您在Word中写{clientName}时,文本通常会在文档中拆分成{,clientName,}.

It works globally the same as your algorithm but won't crash if the content is not in one stroke (when you write {clientName} in Word, the text will usually be splitted: {, clientName, } in the document.

这篇关于在.docx文件中查找和替换文本-Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆