任何方式来检测和删除(或修复)坏的字符,由于错误的编码转换 [英] any way to detect and remove (or fix) bad characters resulting from bad encoding conversions

查看:194
本文介绍了任何方式来检测和删除(或修复)坏的字符,由于错误的编码转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在写一个解析器。我已经照顾所有的编码转换输出UTF-8正确,但有时源材料不正确。例如â€tm - 不良的编码转换结果。

I am writing a parser. I have taken care of all the encoding conversion to output UTF-8 correctly, but sometimes the source material is incorrect. such as or â€tm - the results of bad encoding conversion.

我知道这是一个长镜头 - 但任何人都知道一个列表中的字符串转换,或任何东西,所以我不需要建立自己的列表。

I know this is a long shot - but does anyone know of a list of common strings resulting from bad character conversions, or anything so I don't have to build my own list.

是的,我知道我很懒,但是我在某处读到了一个很好的程序员。

Yes I know I am being lazy, but I read somewhere that makes me a good programmer?

推荐答案

tl; dr:查看最后两段。

我讨厌/爱编码问题。

我们正在查看用户字符RIGHT SINGLE QUOTATION MARK(U + 2019)。该字符的字节序列为 0xE2 0x80 0x99 。在Windows-1252中,对应于+ hat,Euro和商标符号(™)。我们看到的'tm'是该商标符号进一步音译成ASCII t和ASCII m, 0x74 0x6D ,使我们的最终损坏的字节序列 0xE2 0x80 0x74 0x6D

We're looking at a mutated copy of Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019). The byte sequence for that character is 0xE2 0x80 0x99. In Windows-1252, that corresponds to a+hat, Euro, and the trademark symbol (™). The 'tm' we see is a further transliteration of that trademark symbol into ASCII t and ASCII m, 0x74 0x6D, making our final corrupted sequence of bytes 0xE2 0x80 0x74 0x6D.

很可能是+ hat-euro-tm的实际表示已经是UTF-8。也就是说,a + hat是一个UTF-8序列,欧元符号也是一个UTF-8序列,因为有人从Windows-1252文档中复制,该文档已经被不适当地编码并粘贴到UTF-8文档中。

Chances are that the actual representation of a+hat-euro-t-m is already in UTF-8. That is, that a+hat is a UTF-8 sequence and the Euro symbol is also a UTF-8 sequence, because someone Copied from a Windows-1252 document that was already improperly encoded, and Pasted into a UTF-8 document. You'll find it's plenty more bytes than just the four from the original corruption.

解决这个问题的一种方法是首先将这些字符的UTF-8编码转回

One way to solve this would be first turning the UTF-8 encoding of those characters back into Windows-1252, then treat that Windows-1252 string as UTF-8 when writing it back out.

您可以使用> // TRANSLIT 。 $ .php.net / manual / en / function.iconv.php> c> flag for this purpose:

You can use iconv with the //TRANSLIT flag for this purpose:

$less_bad = iconv('UTF-8', 'Windows-1252//TRANSLIT', $bad);

这会告诉iconv尝试将Windows-1252中无法表示的任何字符转换为类似的。

This tells iconv to try turning any characters that can't be represented in Windows-1252 into something similar. This translation is imperfect and will destroy any legitimate UTF-8 characters that aren't representable in Windows-1252.

一旦你有Windows-1252字符串,将其保存回来并作为UTF-8服务。如果一切顺利,腐败应该走了,你不应该有任何问题。

Once you have the Windows-1252 string, save it back out and serve it up as UTF-8. If all went well, the corruption should be gone, and you shouldn't have any problems.

是的,对。

在这种特殊情况下,正确序列的最后字节 0x99 已被错误的复制/粘贴分成两个字节。

In this specific case, the final byte of the proper sequence, 0x99, has been munged into two bytes by a bad Copy/Paste. You aren't going to get it back through character set encoding hoop jumping.

工作的一些文件,你肯定会发现很多事情,甚至更差的重新编码。 您最好的办法是进行字节级搜索和替换操作,查找不正确的编码序列,并将其替换为纯ASCII或正确的UTF-8编码替代。批次的编码方式错误。例如,如果腐败源在ISO-8859系列中,最终的损坏序列将不同,或者可能是最终的™可能不会在某些地方插入 t m

While the hoop jumping could work for some documents, you will surely find many things that are even more poorly re-encoded. Your best bet is going to be conducting a byte-level search and replace operation, looking for incorrectly encoded sequences and replacing them with a plain-ASCII or properly UTF-8 encoded alternative. There are lots of ways that the encoding would be wrong. For example, if the corruption source was in the ISO-8859 family, the final corrupted sequence would have been different, or perhaps the final ™ might not be munched into t and m in certain places.

字节级搜索和替换保证仅影响不正确的重新编码的序列,并且不会对不能在劣等字符集中表示的单个编码的UTF-8字符留下武断的风险。它更安全,更快。

A byte-level search and replace is guaranteed only to impact incorrectly re-encoded sequences, and will not leave the risk of munching on single-encoded UTF-8 characters that can't be represented in inferior character sets. It's safer and faster.

编辑:我完全没有实际捕捉到你已经计划这样做了。 ;)不幸的是,我从来没有见过这么方便的列表。也许你应该发布和宣传你的工作,让别人可以受益。 yourcharacterencodingsucks.com 可用!

edit: I totally didn't actually catch that you were already planning on doing this. ;) Unfortunately I've never seen such a handy list. Perhaps you should publish and publicize your work so that others may benefit. yourcharacterencodingsucks.com is available!

这篇关于任何方式来检测和删除(或修复)坏的字符,由于错误的编码转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆