法语和lxml文本 [英] French and lxml text

查看:72
本文介绍了法语和lxml文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用lxml将有效的法语文本字符串分配给文本字符串:

I'm trying to assign a valid French text string to a text string using lxml:

el = etree.Element("someelement")
el.text = 'Disponible à partir du 1er Octobre'

我得到了错误:

ValueError:所有字符串必须与XML兼容:Unicode或ASCII,否 空字节或控制字符

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

我也尝试过:

el.ext = etree.CDATA('Disponible à partir du 1er Octobre')

但是我遇到同样的错误.

However I get the same error.

如何处理XML中的法语,尤其是ISO-8859-1?有一些方法可以在lxml的tostring()函数中指定编码,但不能在元素中分配文本值.

How do I handle French in XML, in particular, ISO-8859-1? There are ways to specify encoding within the tostring() function in lxml, but not for assigning text values within elements.

推荐答案

如果文本包含非ASCII数据,则应将其作为el.text的Unicode字符串提供.

If text contains non-ascii data then you should provide it as a Unicode string for el.text.

@Abbasov Alexander的答案显示,您可以使用Unicode文字u''来做到这一点. Python没有引发异常,因此我假设您已经声明了Python源文件的字符编码(例如,在顶部使用# coding: utf-8注释).这种编码定义了Python如何解释源代码中的非ASCII字符,它与用于将xml保存到文件中的编码无关.

As @Abbasov Alexander's answer shows you could do it using a Unicode literal u''. Python hasn't raise an exception so I assume that you've declared a character encoding of your Python source file (e.g., using # coding: utf-8 comment at the top). This encoding defines how Python interprets non-ascii characters in the source, it is unrelated to the encoding you use to save xml to a file.

如果文本已经存在于变量中,并且尚未将其转换为Unicode,则可以使用text.decode(text_encoding)进行操作(text_encoding可能与Python源编码无关).

If the text is already in a variable and you haven't converted it to Unicode yet, you could do it using text.decode(text_encoding) (text_encoding may be unrelated to the Python source encoding).

令人困惑的是,el.text(作为优化)在Python 2上为纯ascii数据返回了一个字节串.它违反了不得混用字节和Unicode字符串的规则.尽管sys.getdefaultencoding()像大多数情况一样返回基于ascii的编码应该可以工作.

The confusing bit might be that el.text (as an optimization) returns a bytestring on Python 2 for pure ascii data. It breaks the rule that you should not mix bytes and Unicode strings. Though It should work if sys.getdefaultencoding() returns an ascii-based encoding as it does in most cases.

要保存xml,请将所需的任何字符编码传递给tostring()ElementTree.write()函数.同样,此编码与其他已经提到的编码无关.

To save xml, pass any character encoding you need totostring() or ElementTree.write() functions. Again, this encoding is unrelated to others already mentioned encodings.

通常,使用 Unicode三明治:尽快将字节解码为Unicode收到它们后,请在程序中使用Unicode文本,当需要使用不支持Unicode的API(文件,网络)发送文本时,请尽可能晚地将其编码为字节.

In general, use Unicode sandwich: decode bytes to Unicode as soon as you receive them, work with Unicode text inside your program, encode to bytes as late as possible when you need to send the text using API that doesn't support Unicode (files, network).

这篇关于法语和lxml文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆