Unicode 字符串的 lxml.etree.XML ValueError [英] lxml.etree.XML ValueError for Unicode string

查看:28
本文介绍了Unicode 字符串的 lxml.etree.XML ValueError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在转换一个带有文档"https://gist.github.com/guinslym/5ce47460a31fe4c4046b#file-test_xslt-xslt" rel="noreferrer">xslt.使用 python3 执行此操作时,出现以下错误.但是我在使用 python2 时没有任何错误

I'm transforming an xml document with xslt. While doing it with python3 I had this following error. But I don't have any errors with python2

-> % python3 cstm/artefact.py
Traceback (most recent call last):
  File "cstm/artefact.py", line 98, in <module>
    simplify_this_dataset('fisheries-service-des-peches.xml')
  File "cstm/artefact.py", line 85, in simplify_this_dataset
    xslt_root = etree.XML(xslt_content)
  File "lxml.etree.pyx", line 3012, in lxml.etree.XML (src/lxml/lxml.etree.c:67861)
  File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102420)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

#!/usr/bin/env python3
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
# -*- coding: utf-8 -*-

from lxml import etree

def simplify_this_dataset(dataset):
    """Create A simplify version of an xml file
    it will remove all the attributes and assign them as Elements instead
    """
    module_path = os.path.dirname(os.path.abspath(__file__))
    data = open(module_path+'/data/ex-fire.xslt')
    xslt_content = data.read()
    xslt_root = etree.XML(xslt_content)
    dom = etree.parse(module_path+'/../CanSTM_dataset/'+dataset)
    transform = etree.XSLT(xslt_root)
    result = transform(dom)
    f = open(module_path+ '/../CanSTM_dataset/otra.xml', 'w')
    f.write(str(result))
    f.close()

推荐答案

data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()

这会使用默认编码将文件中的字节隐式解码为 Unicode 文本.(如果 XML 文件不是那种编码,这可能会产生错误的结果.)

This implicitly decodes the bytes in the file to Unicode text, using the default encoding. (This might give wrong results, if the XML file isn't in that encoding.)

xslt_root = etree.XML(xslt_content)

XML 有自己的编码处理和信令,<?xml encoding="..."?> 序言.如果您将一个以 <?xml encoding="..."?> 开头的 Unicode 字符串传递给解析器,解析器希望使用该编码重新解释字节字符串的其余部分.. 但不能,因为您已经将字节输入解码为 Unicode 字符串.

XML has its own handling and signalling for encodings, the <?xml encoding="..."?> prolog. If you pass a Unicode string starting with <?xml encoding="..."?> to a parser, the parser would like to reintrepret the rest of the byte string using that encoding... but can't, because you've already decoded the byte input to a Unicode string.

相反,您应该将未解码的字节字符串传递给解析器:

Instead, you should either pass the undecoded byte string to the parser:

data = open(module_path+'/data/ex-fire.xslt', 'rb')

xslt_content = data.read()
xslt_root = etree.XML(xslt_content)

或者,更好的是让解析器直接从文件中读取:

or, better, just have the parser read straight from the file:

xslt_root = etree.parse(module_path+'/data/ex-fire.xslt')

这篇关于Unicode 字符串的 lxml.etree.XML ValueError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆