如何使用Python保存已编辑的Word文档? [英] How can I save an edited Word document with Python?

查看:106
本文介绍了如何使用Python保存已编辑的Word文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个脚本,该脚本可以使用Python从Word文档中提取XML,对其进行修改,最后保存新的Word文档.这是我使用的代码,实际上已从此处:

I am attempting to create a script which can extract the XML from a Word document, modify it, and finally save the new Word document, all using Python. Here's the code I used, which was effectively stolen from here:

import zipfile
import os
import tempfile
import shutil


def getXml(docxFilename):
    zip = zipfile.ZipFile(open(docxFilename,"rb"))
    xmlString = str(zip.read("word/document.xml"))
    return xmlString

def createNewDocx(originalDocx,xmlContent,newFilename):
    tmpDir = tempfile.mkdtemp()
    zip = zipfile.ZipFile(open(originalDocx,"rb"))
    zip.extractall(tmpDir)
    with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
        f.write(xmlContent)
    filenames = zip.namelist()
    zipCopyFilename = newFilename
    with zipfile.ZipFile(zipCopyFilename,"w") as docx:
        for filename in filenames:
            docx.write(os.path.join(tmpDir,filename),filename)
    shutil.rmtree(tmpDir)

我的代码与Virantha的代码之间的一个重要区别是,他将createNewDocx表示为一个类.不幸的是,我不知道什么是类或它们如何工作,所以我认为编写一个函数会更容易.

One important difference between my code and Virantha's is that he expressed createNewDocx as a class. Unfortunately I don't know what classes are or how they work, so I figured it would be easier to write a function instead.

getXML 从Word文档中提取XML.我在一个测试文档(名为 test.docx )上进行了测试,效果很好.从理论上讲, createNewDocx 应该采用原始docx文件(在本例中为 test.docs )和修改后的XML作为字符串来创建一个新的Word文档,标题为newFileName.

getXML extracts the XML from a Word document. I tried it out on a test document (named test.docx) and it worked well. In theory, createNewDocx is supposed to take the original docx file (in this case, test.docs) and the modified XML as a string to create a new Word document, entitled newFileName.

作为测试,我使用原始XML运行了 createNewDocx ,以查看是否会获得 text.docx 的复制版本.也就是说,我跑了

As a test, I ran createNewDocx with the original XML to see if I would get a copied version of text.docx. That is, I ran

originalXml = getXml("test.docx")
createNewDocx("test.docx",originalXml,"test2.docx")

这确实创建了一个名为"test2.docx"的Word文档,但是当我尝试打开该文件时,它只是无法打开.话会崩溃.

This did indeed create a Word document entitled "test2.docx", but when I tried to open the file it just wouldn't open; Word would just crash.

有人知道我如何修改我的代码以使其正常工作吗?

Does anyone know how I can modify my code to make it work?

我决定包括 originalXml ,以防格式方面的问题.

I decided to include originalXml in case there's some problem with how it's formatted.

b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00000000" w:rsidRDefault="00971B91"><w:r><w:t>You owe me ${debt}. Pay back soon.</w:t></w:r></w:p><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00971B91"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t xml:space="preserve">You owe me </w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:b/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>${debt}</w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t xml:space="preserve">. Pay back </w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:i/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>soon.</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'

我仔细查看了上面的XML代码,意识到在开头有一个不寻常的"b'",在结尾处有一个括号.我删除了这些异常,然后再次运行了代码.现在,Word给了我一个更明智的错误,即第1行,第56列"存在问题.对应于上面的XML代码中的"\ r \".

I looked more closely at the XML code above and realized that there was an unusual "b'" at the beginning and a close parentheses at the end. I removed these anomalies and ran the code again. Now Word is giving me a more sensible error, namely that there's a problem with "line 1, column 56." That corresponds to the "\r\" in the XML code above.

因此,很明显,我的代码没有正确提取XML.有人知道如何解决这个问题吗?

So obviously my code isn't extracting the XML properly. Anyone know how to fix this?

推荐答案

通过强制转换"zip.read(" word/document.xml)",将字节强制转换为字符串,以便将'b'保留为字符

By casting "zip.read("word/document.xml")", you cast a byte to string so you keep the 'b' as a char.

def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = str(zip.read("word/document.xml"))
return xmlString

因此,这就是"xmlString"没有属性的原因,因为它是一个字符串.您必须先删除解码,然后再返回:

So that's why the "xmlString" has no attribute because it's a string. You have to remove you cast an decode before return:

def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = zip.read("word/document.xml")
return xmlString.decode('utf-8')

希望对其他人有帮助!

这篇关于如何使用Python保存已编辑的Word文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆