UnicodeEncodeError:打印法语文本时,'ascii'编解码器无法对字符'\ xe9'进行编码 [英] UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' when printing French text

查看:54
本文介绍了UnicodeEncodeError:打印法语文本时,'ascii'编解码器无法对字符'\ xe9'进行编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在清理Europarl的法语单语语料库( http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz ). .gz 文件中的原始原始数据(我使用 wget 下载了).我想提取文本并查看其外观,以便进一步处理语料库.

I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gzfile (I downloaded using wget). I want to extract the text and see how it looks like in order to further process the corpus.

使用以下代码从 gzip 中提取文本,我获得了数据为 bytes 的数据.

Using the following code to extract the text from gzip, I obtained data with the class being bytes.

with gzip.open(file_path, 'rb') as f_in:
    print('type(f_in)=', type(f_in))
    text = f_in.read()
    print('type(text)=', type(text))

第一行的打印结果如下:

The printed results for several first lines are as follows:

type(f_in)=类'gzip.GzipFile'

type(f_in) = class 'gzip.GzipFile'

type(text)=类字节"

type(text)= class 'bytes'

会话的重复\ nc \ xc3 \ xa9声明会话\\ xc3 \ xa9t \ xc3 \ xa9的临时插入\ xc3 \ xa9t \ xc3 \ xa9插入的内容来自vendredi 17 d \ xc3 \ xa9cembre dernier et je\ xc3 \ xa9rant que vous avez pass \ xc3 \ xa9 de bonnes vacances.\ n请在"bogue de l'an'2000"和"est pas produit"上使用vous avez pu con conerer.\ n

b'Reprise de la session\nJe d\xc3\xa9clare reprise la session du Parlement europ\xc3\xa9en qui avait \xc3\xa9t\xc3\xa9 interrompue le vendredi 17 d\xc3\xa9cembre dernier et je vous renouvelle tous mes vux en esp\xc3\xa9rant que vous avez pass\xc3\xa9 de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l\'an 2000" ne s\'est pas produit.\n

我尝试使用以下代码使用 utf8 ascii 解码二进制数据:

I tried to decode the binary data with utf8 and ascii with the following code:

with gzip.open(file_path, 'rb') as f_in:
    print('type(f_in)=', type(f_in))
    text = f_in.read().decode('utf8')
    print('type(text)=', type(text))

它返回了这样的错误:

UnicodeEncodeError:'ascii'编解码器无法在位置26编码字符'\ xe9':序数不在范围内(128)

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 26: ordinal not in range(128)

我还尝试使用 codecs unicodedata 包打开文件,但它也返回了编码错误.

I also tried using codecs and unicodedata packages to open the file but it returned encoding error as well.

例如,请您帮我解释一下如何以正确的格式获取法语文本吗?

Could you please help me explain what I should do to get the French text in the correct format like this for example?

会议重现\ n在巴黎国会议事日间进行互通有无,请重新确认您的要求,并请您在本国的空缺职位上进行交流,\ venz avezPasséde bonnes vacances.le grand"bogue de l'an 2000"是最有趣的产品.\ n

Reprise de la session\nJe déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances.\nComme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit.\n

非常感谢您的帮助!

推荐答案

发生UnicodeEncodeError的原因是,在打印时,Python将字符串编码为字节,但是在这种情况下,使用的编码-ASCII-没有与'\相匹配的字符xe9',因此会引发错误.

The UnicodeEncodeError is occurring because when printing, Python encodes strings to bytes, but in this case, the encoding being used - ASCII - has no character that matches '\xe9', so the error is raised.

设置 PYTHONIOENCODING 环境变量会强制Python使用不同的编码-环境变量的值.UTF-8编码可以编码任何字符,因此像这样调用程序可以解决此问题:

Setting the PYTHONIOENCODING environment variable forces Python to use a different encoding - the value of the environment variable. The UTF-8 encoding can encode any character, so calling the program like this solves the issue:

PYTHONIOENCODING=UTF-8 python3  europarl_extractor.py

假设代码是这样的:

import gzip

if __name__ == '__main__':
    with gzip.open('europarl-v7.fr.gz', 'rb') as f_in:
        bs = f_in.read()
        txt = bs.decode('utf-8')
        print(txt[:100])

可以通过其他方式设置环境变量-通过 export 语句,在 .bashrc .profile 等中.

The environment variable may be set in other ways - via an export statement, in .bashrc, .profile etc.

一个有趣的问题是为什么 Python试图将输出编码为ASCII.我假设在* nix系统上,Python本质上是查看 $ LANG 环境变量来确定要使用的编码.但是,如果 $ LANG 的值为 fr_FR.UTF-8 ,而Python使用的是ASCII作为输出编码.

An interesting question is why Python is trying to encode output as ASCII. I had assumed that on *nix systems, Python essentially looked at the $LANG environment variable to determine the encoding to use. But in the case the value of $LANG is fr_FR.UTF-8, and yet Python is using ASCII as the output encoding.

查看中的> locale 模块,以及此常见问题解答,将按以下顺序检查以下环境变量:

From looking at the source for the locale module, and this FAQ, these environment variables are checked, in order:

'LC_ALL', 'LC_CTYPE', 'LANG', 'LANGUAGE'

因此,可能是 LC_ALL LC_CTYPE 之一已设置为在您的环境中要求ASCII编码的值(您可以通过运行终端中的locale 命令;还运行 locale charmap 会告诉您编码本身.

So it may be that one of LC_ALL or LC_CTYPE has been set to a value that mandates ASCII encoding in your environment (you can check by running the locale command in your terminal; also running locale charmap will tell you the encoding itself).

这篇关于UnicodeEncodeError:打印法语文本时,'ascii'编解码器无法对字符'\ xe9'进行编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆