在UTF-8文本中修复Mojibakes [英] Fixing mojibakes in UTF-8 text

查看：58 发布时间：2020/7/13 3:39:29 python utf-8 character-encoding mojibake

本文介绍了在UTF-8文本中修复Mojibakes的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文件，其文本为UTF-8葡萄牙语.产生文件的人以某种方式选择了错误的编码，并且文本中充满了 mojibake :

I have a file with text in Portuguese in UTF-8. Somehow, who produced the file selected the wrong encoding, and the text is full of mojibake:

IDENTIFICAÌàÌÄO instead of identificação
AndrÃ© instead of André

自动工具看不到文件中的任何错误.我尝试使用 Python软件包ftfy 进行修复，但无济于事. 除了手动替换所有不正确的字符外，如何修复此文件?

Automated tools do not see anything wrong with the file. I tried to fix it with Python package ftfy to no avail. How can I fix this file, apart from replacing all incorrect characters manually?

推荐答案

AndrÃ©"而不是André"是UTF-8编码的Latin-1解释. 您可以通过反转编码/解码来解决它:

"AndrÃ©" instead of "André" is the Latin-1 interpretation of UTF-8 encoding. You can fix it by inverting the encoding/decoding:

>>> 'AndrÃ©'.encode('latin-1').decode('utf-8')
'André'

遵循这种模式的所有情况都可以像这样解决.

All cases following this pattern can be fixed like that.

但是，我无法解释其他情况(对于ç"使用Ìà"，对于ã"使用ÌÄ")，因此无法提供解决方案. 如果找到Ì"，à"和Ä"分别具有代码点C3，A7和A3的编解码器，则可以使用此编解码器代替Latin-1来固定文本.

However, I can't explain the other case (with "Ìà" for "ç" and "ÌÄ" for "ã"), and therefore can't provide a solution. If you can find a codec where "Ì", "à", and "Ä" have the codepoints C3, A7, and A3, respectively, then you can use this instead of Latin-1 for fixing the text.

这篇关于在UTF-8文本中修复Mojibakes的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在UTF-8文本中修复Mojibakes [英] Fixing mojibakes in UTF-8 text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在UTF-8文本中修复Mojibakes [英] Fixing mojibakes in UTF-8 text

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭