Unicode 字符串的 lxml.etree.XML ValueError [英] lxml.etree.XML ValueError for Unicode string
问题描述
我正在转换一个带有文档"https://gist.github.com/guinslym/5ce47460a31fe4c4046b#file-test_xslt-xslt" rel="noreferrer">xslt.使用 python3 执行此操作时,出现以下错误.但是我在使用 python2 时没有任何错误
I'm transforming an xml document with xslt. While doing it with python3 I had this following error. But I don't have any errors with python2
-> % python3 cstm/artefact.py
Traceback (most recent call last):
File "cstm/artefact.py", line 98, in <module>
simplify_this_dataset('fisheries-service-des-peches.xml')
File "cstm/artefact.py", line 85, in simplify_this_dataset
xslt_root = etree.XML(xslt_content)
File "lxml.etree.pyx", line 3012, in lxml.etree.XML (src/lxml/lxml.etree.c:67861)
File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102420)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
#!/usr/bin/env python3
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
# -*- coding: utf-8 -*-
from lxml import etree
def simplify_this_dataset(dataset):
"""Create A simplify version of an xml file
it will remove all the attributes and assign them as Elements instead
"""
module_path = os.path.dirname(os.path.abspath(__file__))
data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
dom = etree.parse(module_path+'/../CanSTM_dataset/'+dataset)
transform = etree.XSLT(xslt_root)
result = transform(dom)
f = open(module_path+ '/../CanSTM_dataset/otra.xml', 'w')
f.write(str(result))
f.close()
推荐答案
data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
这会使用默认编码将文件中的字节隐式解码为 Unicode 文本.(如果 XML 文件不是那种编码,这可能会产生错误的结果.)
This implicitly decodes the bytes in the file to Unicode text, using the default encoding. (This might give wrong results, if the XML file isn't in that encoding.)
xslt_root = etree.XML(xslt_content)
XML 有自己的编码处理和信令,<?xml encoding="..."?>
序言.如果您将一个以 <?xml encoding="..."?>
开头的 Unicode 字符串传递给解析器,解析器希望使用该编码重新解释字节字符串的其余部分.. 但不能,因为您已经将字节输入解码为 Unicode 字符串.
XML has its own handling and signalling for encodings, the <?xml encoding="..."?>
prolog. If you pass a Unicode string starting with <?xml encoding="..."?>
to a parser, the parser would like to reintrepret the rest of the byte string using that encoding... but can't, because you've already decoded the byte input to a Unicode string.
相反,您应该将未解码的字节字符串传递给解析器:
Instead, you should either pass the undecoded byte string to the parser:
data = open(module_path+'/data/ex-fire.xslt', 'rb')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
或者,更好的是让解析器直接从文件中读取:
or, better, just have the parser read straight from the file:
xslt_root = etree.parse(module_path+'/data/ex-fire.xslt')
这篇关于Unicode 字符串的 lxml.etree.XML ValueError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!