任何方式来检测和删除（或修复）坏的字符，由于错误的编码转换 [英] any way to detect and remove (or fix) bad characters resulting from bad encoding conversions

查看：194 发布时间：2016/11/19 14:46:25 php character-encoding

本文介绍了任何方式来检测和删除（或修复）坏的字符，由于错误的编码转换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在写一个解析器。我已经照顾所有的编码转换输出UTF-8正确，但有时源材料不正确。例如☐或â€tm - 不良的编码转换结果。

I am writing a parser. I have taken care of all the encoding conversion to output UTF-8 correctly, but sometimes the source material is incorrect. such as ☐ or â€tm - the results of bad encoding conversion.

我知道这是一个长镜头 - 但任何人都知道一个列表中的字符串转换，或任何东西，所以我不需要建立自己的列表。

I know this is a long shot - but does anyone know of a list of common strings resulting from bad character conversions, or anything so I don't have to build my own list.

是的，我知道我很懒，但是我在某处读到了一个很好的程序员。

Yes I know I am being lazy, but I read somewhere that makes me a good programmer?

推荐答案

tl; dr：查看最后两段。

我讨厌/爱编码问题。

我们正在查看用户字符RIGHT SINGLE QUOTATION MARK（U + 2019）。该字符的字节序列为 0xE2 0x80 0x99 。在Windows-1252中，对应于+ hat，Euro和商标符号（™）。我们看到的'tm'是该商标符号进一步音译成ASCII t和ASCII m， 0x74 0x6D ，使我们的最终损坏的字节序列 0xE2 0x80 0x74 0x6D 。

We're looking at a mutated copy of Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019). The byte sequence for that character is 0xE2 0x80 0x99. In Windows-1252, that corresponds to a+hat, Euro, and the trademark symbol (™). The 'tm' we see is a further transliteration of that trademark symbol into ASCII t and ASCII m, 0x74 0x6D, making our final corrupted sequence of bytes 0xE2 0x80 0x74 0x6D.

很可能是+ hat-euro-tm的实际表示已经是UTF-8。也就是说，a + hat是一个UTF-8序列，欧元符号也是一个UTF-8序列，因为有人从Windows-1252文档中复制，该文档已经被不适当地编码并粘贴到UTF-8文档中。

Chances are that the actual representation of a+hat-euro-t-m is already in UTF-8. That is, that a+hat is a UTF-8 sequence and the Euro symbol is also a UTF-8 sequence, because someone Copied from a Windows-1252 document that was already improperly encoded, and Pasted into a UTF-8 document. You'll find it's plenty more bytes than just the four from the original corruption.

解决这个问题的一种方法是首先将这些字符的UTF-8编码转回

One way to solve this would be first turning the UTF-8 encoding of those characters back into Windows-1252, then treat that Windows-1252 string as UTF-8 when writing it back out.

您可以使用> // TRANSLIT 。 $ .php.net / manual / en / function.iconv.php> c> flag for this purpose：


You can use iconv with the //TRANSLIT flag for this purpose:
$less_bad = iconv('UTF-8', 'Windows-1252//TRANSLIT', $bad);

这会告诉iconv尝试将Windows-1252中无法表示的任何字符转换为类似的。 
This tells iconv to try turning any characters that can't be represented in Windows-1252 into something similar.  This translation is imperfect and will destroy any legitimate UTF-8 characters that aren't representable in Windows-1252.
一旦你有Windows-1252字符串，将其保存回来并作为UTF-8服务。如果一切顺利，腐败应该走了，你不应该有任何问题。
Once you have the Windows-1252 string, save it back out and serve it up as UTF-8.  If all went well, the corruption should be gone, and you shouldn't have any problems.
是的，对。
在这种特殊情况下，正确序列的最后字节 0x99 已被错误的复制/粘贴分成两个字节。   
In this specific case, the final byte of the proper sequence, 0x99, has been munged into two bytes by a bad Copy/Paste.  You aren't going to get it back through character set encoding hoop jumping.
  工作的一些文件，你肯定会发现很多事情，甚至更差的重新编码。 您最好的办法是进行字节级搜索和替换操作，查找不正确的编码序列，并将其替换为纯ASCII或正确的UTF-8编码替代。有批次的编码方式错误。例如，如果腐败源在ISO-8859系列中，最终的损坏序列将不同，或者可能是最终的™可能不会在某些地方插入 t 和 m 。
While the hoop jumping could work for some documents, you will surely find many things that are even more poorly re-encoded.  Your best bet is going to be conducting a byte-level search and replace operation, looking for incorrectly encoded sequences and replacing them with a plain-ASCII or properly UTF-8 encoded alternative.  There are lots of ways that the encoding would be wrong.  For example, if the corruption source was in the ISO-8859 family, the final corrupted sequence would have been different, or perhaps the final ™ might not be munched into t and m in certain places.
字节级搜索和替换保证仅影响不正确的重新编码的序列，并且不会对不能在劣等字符集中表示的单个编码的UTF-8字符留下武断的风险。它更安全，更快。
A byte-level search and replace is guaranteed only to impact incorrectly re-encoded sequences, and will not leave the risk of munching on single-encoded UTF-8 characters that can't be represented in inferior character sets.  It's safer and faster.
编辑：我完全没有实际捕捉到你已经计划这样做了。 ;）不幸的是，我从来没有见过这么方便的列表。也许你应该发布和宣传你的工作，让别人可以受益。  yourcharacterencodingsucks.com 可用！
edit: I totally didn't actually catch that you were already planning on doing this.  ;)  Unfortunately I've never seen such a handy list.  Perhaps you should publish and publicize your work so that others may benefit.  yourcharacterencodingsucks.com is available!

                        这篇关于任何方式来检测和删除（或修复）坏的字符，由于错误的编码转换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

任何方式来检测和删除（或修复）坏的字符，由于错误的编码转换 [英] any way to detect and remove (or fix) bad characters resulting from bad encoding conversions

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

任何方式来检测和删除（或修复）坏的字符，由于错误的编码转换 [英] any way to detect and remove (or fix) bad characters resulting from bad encoding conversions

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭