解压 mojibake [英] Unbaking mojibake

查看:32
本文介绍了解压 mojibake的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当您解码错误的字符时,您如何识别原始字符串的可能候选者?

Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png

我知道这个图像文件名应该是一些日文字符.但是由于对 urllib 引用/取消引用、编码和解码 iso8859-1、utf8 的各种猜测,我一直无法取消并获得原始文件名.

腐败是可逆的吗?

解决方案

您可以使用 chardet(使用 pip 安装):

导入chardetyour_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"Detected_encoding = chardet.detect(your_str)["encoding"]尝试:right_str = your_str.decode(detected_encoding)除了 UnicodeDecodeError:print("无法估计编码")

结果:时间试験観点(アニメパス)_10秒(不知道对不对)

对于 Python 3(源文件编码为 utf8):

导入chardet导入编解码器falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"尝试:encoding_str = falsely_decoded_str.encode("cp850")除了 UnicodeEncodeError:print("无法编码错误解码的字符串")编码_str = 无如果已编码_str:detected_encoding = chardet.detect(encoded_str)["encoding"]尝试:right_str = encoding_str.decode(detected_encoding)除了 UnicodeEncodeError:打印(无法将encoded_str解码为%s"%detected_encoding)使用 codecs.open("output.txt", "w", "utf-8-sig") 作为输出:out.write(correct_str)

总结:

<预><代码>>>>s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'>>>s.encode('cp850').decode('shift-jis')'时间试験観点(アニメパス)_10秒.png'

When you have incorrectly decoded characters, how can you identify likely candidates for the original string?

Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png

I know for a fact that this image filename should have been some Japanese characters. But with various guesses at urllib quoting/unquoting, encode and decode iso8859-1, utf8, I haven't been able to unmunge and get the original filename.

Is the corruption reversible?

解决方案

You could use chardet (install with pip):

import chardet

your_str = "Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb"
detected_encoding = chardet.detect(your_str)["encoding"]

try:
    correct_str = your_str.decode(detected_encoding)
except UnicodeDecodeError:
    print("Could not estimate encoding")

Result: 時間試験観点(アニメパス)_10秒 (no idea if this could be correct or not)

For Python 3 (source file encoded as utf8):

import chardet
import codecs

falsely_decoded_str = "Ä×èÈÄÄî¦è¤ô_üiâAâjâüâpâXüj_10òb"

try:
    encoded_str = falsely_decoded_str.encode("cp850")
except UnicodeEncodeError:
    print("could not encode falsely decoded string")
    encoded_str = None

if encoded_str:
    detected_encoding = chardet.detect(encoded_str)["encoding"]

    try:
        correct_str = encoded_str.decode(detected_encoding)
    except UnicodeEncodeError:
        print("could not decode encoded_str as %s" % detected_encoding)

    with codecs.open("output.txt", "w", "utf-8-sig") as out:
        out.write(correct_str)

In summary:

>>> s = 'Ä×èÈÄÄî▒è¤ô_üiâAâjâüâpâXüj_10òb.png'
>>> s.encode('cp850').decode('shift-jis')
'時間試験観点(アニメパス)_10秒.png'

这篇关于解压 mojibake的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆