使用Python以法语从Word文档中提取XML时出现问题:生成非法字符 [英] Problems extracting the XML from a Word document in French with Python: illegal characters generated

查看:349
本文介绍了使用Python以法语从Word文档中提取XML时出现问题:生成非法字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在过去几天里,我一直在试图创建一个脚本,它将1)从Word文档中提取XML,2)修改该XML,以及3)使用新的XML来创建和保存一个新的Word文档。在许多stackoverflow用户的帮助下,我最终能够找到看起来很有前途的代码。这是:

  import zipfile 
import os
import tempfile
import shutil

def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,rb))
xmlString = zip.read(word / document.xml)。 decode(utf-8)
return xmlString

def createNewDocx(originalDocx,xmlString,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile .ZipFile(open(originalDocx,rb))
zip.extractall(tmpDir)
with open(os.path.join(tmpDir,word / document.xml),w) as f:
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
使用zipfile.ZipFile(zipCopyFilename,w)as docx:
文件名中的文件名:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
docxFilename 中提取XML

getXml 作为字符串。 createNewDocx 获取原始Word文档,并用 xmlString 替换其XML,这是原始XML的修改版本,保存生成的Word文档为 newFilename



为了检查脚本是否按预期运行,我首先创建了一个测试文档(test.docx)并运行 createNewDocx(test.docx,getXml(test.docx),test2.docx)。如果一切都按预期工作,这应该创建一个相同的test.docx副本,保存为test2.docx。事实上,情况正是如此。



然后我使测试文档更详细,并尝试修改它。



然后我自信地将我的脚本应用到我真正感兴趣修改的Word文档: template.docx 。我运行 createNewDocx(template.docx,getXml(template.docx),template2.docx),期望脚本将生成相同的模板副本.docx但命名为template2.docx。不幸的是,新的Word文档不能打开;显然在XML中有一个非法字符。



我真的不明白为什么我的代码将为我的测试文档,但不是我的实际文档。我会post template.docx的XML,但它包含个人信息。 test.docx和template.docx之间的一个重要区别是template.docx是用法语编写的,因此包含特殊字符,例如重音符,并且撇号看起来不同。我不知道这是什么导致我的麻烦,但我没有其他的想法。

解决方案

问题是,更改 template2.docx word / document.xml 的编码。 word / document.xml (从 template.docx )最初编码为UTF-8对于XML文档)。

  xmlString = zip.read(word / document.xml)。decode(utf-8 )

但是,当您复制 template2.docx 您正在将编码更改为 CP-1252 。根据 open(file,w)的文档


在文本模式下,如果未指定encoding,则使用的编码取决于平台:调用locale.getpreferredencoding(False)以获取当前的语言环境编码。


您表示调用 locale。 getpreferredencoding(False)给你 cp1252 这是编码 word / document.xml



由于您未明确添加<?xml version =1.0encoding =cp1252?> word / document.xml 的开头,Word(或任何其他XML阅读器)将读为UTF-8



因此,您希望将编码指定为 UTF-使用编码参数 open()

$ b b

  with open(os.path.join(tmpDir,word / document.xml),w,encoding =UTF-8)as f:
f.write(xmlString)


Over the past few days I have been attempting to create a script which would 1) extract the XML from a Word document, 2) modify that XML, and 3) use the new XML to create and save a new Word document. With the help of many stackoverflow users I was eventually able to find code that looks very promising. Here it is:

import zipfile
import os
import tempfile
import shutil

def getXml(docxFilename):
    zip = zipfile.ZipFile(open(docxFilename,"rb"))
    xmlString= zip.read("word/document.xml").decode("utf-8")
    return xmlString

def createNewDocx(originalDocx,xmlString,newFilename):
    tmpDir = tempfile.mkdtemp()
    zip = zipfile.ZipFile(open(originalDocx,"rb"))
    zip.extractall(tmpDir)
    with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
        f.write(xmlString)
    filenames = zip.namelist()
    zipCopyFilename = newFilename
    with zipfile.ZipFile(zipCopyFilename,"w") as docx:
        for filename in filenames:
            docx.write(os.path.join(tmpDir,filename),filename)
    shutil.rmtree(tmpDir)

getXml extracts the XML from docxFilename as a string. createNewDocx takes the original Word document and replaces its XML with xmlString, which is a modified version of the original XML, and saves the resulting Word document as newFilename.

To check that the script works as intended, I first created a test document ("test.docx") and ran createNewDocx("test.docx",getXml("test.docx"),"test2.docx"). If everything worked as intended, this was supposed to create an identical copy of test.docx saved as test2.docx. Indeed, that was the case.

I then made the test document more elaborate and experimented with modifying it. And the script still worked!

I then confidently applied my script to the Word document I was actually interested in modifying: template.docx. I ran createNewDocx("template.docx",getXml("template.docx"),"template2.docx"), expecting that the script would generate an identical copy of template.docx but named template2.docx. Unfortunately, the new Word document was not able to open; apparently there was an illegal character in the XML.

I really don't understand why my code would work for my test document but not for my actual document. I would post template.docx's XML but it contains personal information. One important difference between test.docx and template.docx is that template.docx is written in French, and therefore contains special characters like accents, and also the apostrophes look different. I have no idea if this is what's causing my trouble but I have no other ideas.

解决方案

The problem is that you are accidentally changing the encoding on word/document.xml in template2.docx. word/document.xml (from template.docx) is initially encoded as UTF-8 (as is the default encoding for XML documents).

xmlString = zip.read("word/document.xml").decode("utf-8")

However, when you copy it for template2.docx you are changing the encoding to CP-1252. According to the documentation for open(file, "w"),

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

You indicated that calling locale.getpreferredencoding(False) gives you cp1252 which is the encoding word/document.xml is being written.

Since you did not explicitly add <?xml version="1.0" encoding="cp1252"?> to the beginning of word/document.xml, Word (or any other XML reader) will read it as UTF-8 instead of CP-1252 which is what gives you the illegal XML character error.

So you want to specify the encoding as UTF-8 when writing by using the encoding argument to open():

with open(os.path.join(tmpDir, "word/document.xml"), "w", encoding="UTF-8") as f:
    f.write(xmlString)

这篇关于使用Python以法语从Word文档中提取XML时出现问题:生成非法字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆