如何在python中编写自定义编码以清理数据? [英] how do I write a custom encoding in python to clean up my data?

查看:61
本文介绍了如何在python中编写自定义编码以清理数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道我之前做过另一项工作,但是我不记得自己做了什么。

I know I've done this before at another job, but I can't remember what I did.

我有一个充满varchar和从Office,网页剪切和粘贴的备注字段,还有谁知道其他地方。这开始对我造成编码错误。由于Python具有非常好的解码功能来获取字节流并将其转换为Unicode,因此我认为只需编写自己的编码即可解决此问题。 (例如,使用智能引号并将其转换为标准引号。)

I have a database that is full of varchar and memo fields that were cut and pasted from Office, webpages, and who knows where else. This is starting to cause encoding errors for me. Since Python has a very nice "decode" function to take a byte stream and translate it into Unicode, I thought that would just write my own encoding to fix this up. (For example, to take "smart quotes" and turn them into "standard quotes".)

但是我不记得如何开始。我想我复制了一种接近的编码(cp1252.py),然后对其进行了更新。

But I can't remember how to get started. I think I copied one of the encodings that was close (cp1252.py) and then updated it.

有人可以把我放在正确的道路上吗?还是建议一个更好的路径?

Can anyone put me on the right path? Or suggest a better path?

推荐答案

我已经对此进行了更详细的扩展。

I've expanded this with a bit more detail.

如果可以肯定地确定数据库中文本的编码,则可以执行 text.decode('cp1252')以获取Unicode串。如果猜错了,这很可能会异常爆发,否则解码器会消失一些字符。

If you are reasonably sure of the encoding of the text in the database, you can do text.decode('cp1252') to get a Unicode string. If the guess is wrong this will likely blow up with an exception, or the decoder will 'disappear' some characters.

很容易按照您描述的方式创建解码器(修改 cp1252.py )。您只需要定义从字节到Unicode字符的转换表即可。

Creating a decoder along the lines you describe (modifying cp1252.py) is easy. You just need to define the translation table from bytes to Unicode characters.

但是,如果数据库中的所有文本都不都是相同的编码,则解码器将需要一些规则决定哪个是正确的映射。在这种情况下,您可能需要平底船并使用 chardet模块 ,可以扫描文本并猜测编码。

However if not all of the text in the database has the same encoding, your decoder will need some rules to decide which is the correct mapping. In this case you may want punt and use the chardet module, which can scan the text and make a guess the encoding.

也许最好的方法是尝试使用最可能的编码(cp1252)进行解码,如果失败,则回退到使用chardet猜测正确的编码。

Maybe the best approach would be try to decode using the most likely encoding (cp1252) and if that fails, fallback to using chardet to guess the correct encoding.

如果使用 text.decode()和/或chardet,则表示以Unicode字符串结尾。下面是一个简单的例程,可以转换Unicode字符串中的字符,例如将卷曲引号转换为ASCII:

If you use text.decode() and/or chardet, you'll end up with a Unicode string. Below is a simple routine which can translate characters in a Unicode string, e.g. "convert curly quotes to ASCII":

CHARMAP = [
    (u'\u201c\u201d', '"'),
    (u'\u2018\u2019', "'")
    ]

# replace with text.decode('cp1252') or chardet
text = u'\u201cit\u2019s probably going to work\u201d, he said'

_map = dict((c, r) for chars, r in CHARMAP for c in list(chars))
fixed = ''.join(_map.get(c, c) for c in text)
print fixed

输出:

"it's probably going to work", he said

这篇关于如何在python中编写自定义编码以清理数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆