在UTF-8文本中修复Mojibakes [英] Fixing mojibakes in UTF-8 text

查看:58
本文介绍了在UTF-8文本中修复Mojibakes的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,其文本为UTF-8葡萄牙语.产生文件的人以某种方式选择了错误的编码,并且文本中充满了 mojibake :

I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake:

IDENTIFICAÌàÌÄO instead of identificação
André instead of André

自动工具看不到文件中的任何错误.我尝试使用 Python软件包ftfy 进行修复,但无济于事. 除了手动替换所有不正确的字符外,如何修复此文件?

Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail. How can I fix this file, apart from replacing all incorrect characters manually?

推荐答案

André"而不是André"是UTF-8编码的Latin-1解释. 您可以通过反转编码/解码来解决它:

"André" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it by inverting the encoding/decoding:

>>> 'André'.encode('latin-1').decode('utf-8')
'André'

遵循这种模式的所有情况都可以像这样解决.

All cases following this pattern can be fixed like that.

但是,我无法解释其他情况(对于ç"使用Ìà",对于ã"使用ÌÄ"),因此无法提供解决方案. 如果找到Ì",à"和Ä"分别具有代码点C3,A7和A3的编解码器,则可以使用此编解码器代替Latin-1来固定文本.

However, I can't explain the other case (with "Ìà" for "ç" and "ÌÄ" for "ã"), and therefore can't provide a solution. If you can find a codec where "Ì", "à", and "Ä" have the codepoints C3, A7, and A3, respectively, then you can use this instead of Latin-1 for fixing the text.

这篇关于在UTF-8文本中修复Mojibakes的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆