回收国际化垃圾 [英] recycling internationalized garbage

查看:95
本文介绍了回收国际化垃圾的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嗨伙计们,


请帮我解决国际字符串问题:

我整理了一个AJAX唱片搜索引擎

http://www.xfeedme.com/discs/discography.html


使用来自FreeDB音乐数据库的数据

http://www.freedb.org/


不幸的是FreeDB里面有很多垃圾,包括

随机混合的字符编码为国际

字符串。作为权宜之计,我决定删除所有不是ascii的所有

字符,所以我可以得到

运行的东西。现在我浏览一下日志文件并注意到

a某些类别的用户立刻就住在这个

上,并发现它很有趣,看看我有多糟糕了。 />
字符串:(。我认为他们笑了起来并且贬低了关于美国的ascii

的言论,然后永远不会回来。


问题:采用8位

字符串的未知编码并恢复最大的合理金额是一个很好的策略来自它的信息(如果需要,翻译为

utf8)?字符串可能是在unicode之前的任何

无数编码。有人

已经用Python完成了吗?输出必须干净

utf8适合任意xml解析器。


谢谢, - Aaron Watters


===


有人曾经对舒伯特说过

带我去你的leider(抱歉)。

- Tom Lehrer

Hi folks,

Please help me with international string issues:
I put together an AJAX discography search engine

http://www.xfeedme.com/discs/discography.html

using data from the FreeDB music database

http://www.freedb.org/

Unfortunately FreeDB has a lot of junk in it, including
randomly mixed character encodings for international
strings. As an expediency I decided to just delete all
characters that weren''t ascii, so I could get the thing
running. Now I look through the log files and notice that
a certain category of user immediatly homes in on this
and finds it amusing to see how badly I''ve mangled
the strings :(. I presume they chuckle and make
disparaging remarks about "united states of ascii"
and then leave never to return.

Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.

Thanks, -- Aaron Watters

===

As someone once remarked to Schubert
"take me to your leider" (sorry about that).
-- Tom Lehrer

推荐答案

" aa *************** @ yahoo。 COM"写道:
"aa***************@yahoo.com" wrote:
问题:采用8位未知编码字符串并从中恢复最大量的合理信息是一个很好的策略(翻译为
utf8如果需要)?该字符串可能位于unicode之前的任何无数编码中。有没有人已经用Python完成了这个?输出必须干净
utf8适合任意xml解析器。
Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.




一些替代方案:


脑力暴力:


尝试严格解码为utf-8。如果你成功了,你有一个utf-8

字符串。如果不是,假设iso-8859-1。


稍微聪明一点的暴力:

http://aspn.activestate.com/ASPN/Coo.../Recipe/163743


更高级(但对于非常短的文本可能不够好):

http://chardet.feedparser.org/




some alternatives:

braindead bruteforce:

try to do strict decoding as utf-8. if you succeed, you have an utf-8
string. if not, assume iso-8859-1.

slightly smarter bruteforce:

http://aspn.activestate.com/ASPN/Coo.../Recipe/163743

more advanced (but possibly not good enough for very short texts):

http://chardet.feedparser.org/

</F>


Fredrik Lundh< fr ***** @ pythonware.com>写道:
Fredrik Lundh <fr*****@pythonware.com> wrote:
" aa *************** @ yahoo.com"写道:
"aa***************@yahoo.com" wrote:
问题:采用8位未知编码字符串并从中恢复最大量的合理信息是一个很好的策略(翻译为
utf8如果需要)?该字符串可能位于unicode之前的任何无数编码中。有没有人已经用Python完成了这个?输出必须干净
utf8适合任意xml解析器。
Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)? The string might be in any of the
myriad encodings that predate unicode. Has anyone
done this in Python already? The output must be clean
utf8 suitable for arbitrary xml parsers.



一些替代方案:

脑力暴力:

尝试严格解码为utf-8。如果你成功了,你有一个utf-8
字符串。如果没有,假设iso-8859-1。



some alternatives:

braindead bruteforce:

try to do strict decoding as utf-8. if you succeed, you have an utf-8
string. if not, assume iso-8859-1.




这是我犯过的一次错误。

不要使用iso8859-1作为python编解码器,而是创建自己的编解码器

称为例如像这样的iso8859-1-ncc(只是草图):


decoding_map = codecs.make_identity_dict(范围(32,128)+范围(128 + 32,256))

decoding_map.update({})

encoding_map = codecs.make_encoding_map(decoding_map)


然后使用:


def try_encoding(s,encodings):

"尝试猜测字符串s的编码,测试第二个参数中给出的编码

$ b包含编码的$ b:

尝试:

test = unicode(s,enc)

返回enc

除了UnicodeDecodeError,r:

通过


返回无

guessed_unicode_text = try_encodings(text,[''utf-8' ','''iso8859-1-ncc'',''cp1252'',''macroman''])

它看起来效果出奇的好,如果你大致了解

语言预期文本(例如用中间欧洲语用iso8859-2-ncc替换cp1252和

cp1250,iso8859-1-ncc uages)


-

-------------------------- ---------------------------------

| Radovan Garab?* k http://kassiopeia.juls.savba.sk/~ garabik / |

| __..-- ^^^ --..__ garabik @ kassiopeia.juls.savba.sk |

------------------- ----------------------------------------

防病毒警报:文件.signature被签名病毒感染。

嗨!我是一个签名病毒!将我复制到您的签名文件中以帮助我传播!



that was a mistake I made once.
Do not use iso8859-1 as python codec, instead create your own codec
called e.g. iso8859-1-ncc like this (just a sketch):

decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32, 256))
decoding_map.update({})
encoding_map = codecs.make_encoding_map(decoding_map)

and then use :

def try_encoding(s, encodings):
"try to guess the encoding of string s, testing encodings given in second parameter"

for enc in encodings:
try:
test = unicode(s, enc)
return enc
except UnicodeDecodeError, r:
pass

return None
guessed_unicode_text = try_encodings(text, [''utf-8'', ''iso8859-1-ncc'', ''cp1252'', ''macroman''])
it seems to work surprisingly well, if you know approximately the
language(s) the text is expected to be in (e.g. replace cp1252 with
cp1250, iso8859-1-ncc with iso8859-2-ncc for central european languages)

--
-----------------------------------------------------------
| Radovan Garab?*k http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I''m a signature virus! Copy me into your signature file to help me spread!



aa *************** @ yahoo.com 写道:
问题:什么是好策略获取8bit
未知编码字符串并从中恢复最大量的合理信息(如果需要,翻译为
utf8)?


将未修改的字符串复制到WWW页面并确保您的页面没有
标识所使用的编码。那样它就成了浏览器的问题,

如果阅读页面的用户能够理解语言字符串

就写在那里是非常好的机会浏览器会正确显示

。不幸的是,这就是这样的文本应该显示为



输出必须是干净的utf8,适合任意xml解析器。
Question: what is a good strategy for taking an 8bit
string of unknown encoding and recovering the largest
amount of reasonable information from it (translated to
utf8 if needed)?
Copy the string unmodified to the WWW page and ensure your page doesn''t
identify the encoding used. That way it becomes the browser''s problem,
and if the user reading the page can understand the language the string
is written in there''s a very good chance the browser will display it
correctly. Unfortunately, that''s how text like this is supposed to be
displayed.
The output must be clean utf8 suitable for arbitrary xml parsers.




哦,你当时搞砸了。


Ross Ridge



Oh, you''re screwed then.

Ross Ridge


这篇关于回收国际化垃圾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆