如何测试应用程序的正确编码(如UTF-8) [英] How to test an application for correct encoding (e.g. UTF-8)

查看:157
本文介绍了如何测试应用程序的正确编码(如UTF-8)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

编码问题是在开发过程中最经常咬我的一个话题。每个平台都坚持使用自己的编码,很可能在游戏中有一些非UTF-8默认设置。 (我通常在Linux上工作,默认为UTF-8,我的同事们主要在德国的Windows上工作,默认为ISO-8859-1或类似的Windows代码页)



我相信UTF-8是开发i18nable应用程序的合适标准。然而,在我的经验编码错误通常发现迟到(即使我位于德国,我们有一些特殊字符,连同ISO-8859-1提供一些可检测的差异)。



我相信那些具有完全非ASCII字符集(或那些知道使用这样一个字符集的语言)的开发人员正在开始提供测试数据。但是,对于我们其他人来说,必须有办法缓解这个问题。



这里使用的是什么[技术|工具|激励]?您如何让您的合作开发者关心这些问题?如何测试合规性?这些测试是手动还是自动进行?



预先添加一个可能的答案:



我最近发现<一个href =http://fliptitle.com =noreferrer> fliptitle.com (他们提供了一个简单的方式来获取怪异的人物写的uʍopǝpısdn*),我打算使用他们提供容易验证的UTF-8字符串(因为大多数使用的字符都是在一些奇怪的二进制编码位置),但肯定必须有更多的系统测试,模式或技术来确保UTF-8的兼容性/使用。 >

注意:即使有一个公认的答案,我想知道更多的技巧和模式,如果有的话。如果您有更多的想法,请添加更多答案。并不容易仅仅选择一个接受的答案。我选择了最小预期角度的正则表达式答案来解决问题,尽管有理由选择其他答案。太糟糕了,只有一个答案可以接受。



感谢您的输入。



*)这是上升

解决方案

有一个正则表达式来测试字符串是否有效UTF-8

  $ field =〜
m / \A(
[\x09\x0A\x0D\x20 -\x7E]#ASCII
| [\xC2-\xDF] [\x80-\xBF]#非超大的2字节
| \xE0 [\xA0- \xBF] [\x80-\xBF]#不包括超额
| [\xE1-\xEC\xEE\xEF] [\x80-\xBF] {2}#straight 3字节
| \xED [\x80-\x9F] [\x80-\xBF]#不包括代理
| \xF0 [\x90-\xBF] [ \x80-\xBF] {2}#plane 1- 3
| [\xF1-\xF3] [\x80-\xBF] {3} #plane 4-15
| \xF4 [\x80-\x8F] [ \x80-\xBF] {2}#plane 16
)* \z / x;

但这不能确保文本实际是UTF-8。



一个例子:字母ö(U + 00F6)和相应的UTF-8序列的字节序列是0xC3B6。

所以当你得到0xC3B6作为输入时,你可以说它是有效的UTF-8。但是你不能肯定地说ö已经提交了。

这是因为想象一下,不是使用UTF-8,而是使用ISO 8859-1。那里的序列0xC3B6分别代表字符Ã(0xC3)和¶(0xB6)。

因此,序列0xC3B6可以使用UTF-8或¶使用ISO 8859-1来表示(尽管后者是相当不寻常)。



所以最后只是猜测。


Encoding issues are among the one topic that have bitten me most often during development. Every platform insists on its own encoding, most likely some non-UTF-8 defaults are in the game. (I'm usually working on Linux, defaulting to UTF-8, my colleagues mostly work on german Windows, defaulting to ISO-8859-1 or some similar windows codepage)

I believe, that UTF-8 is a suitable standard for developing an i18nable application. However, in my experience encoding bugs are usually discovered late (even though I'm located in Germany and we have some special characters that along with ISO-8859-1 provide some detectable differences).

I believe that those developers with a completely non-ASCII character set (or those that know a language that uses such a character set) are getting a head start in providing test data. But there must be a way to ease this for the rest of us as well.

What [technique|tool|incentive] are people here using? How do you get your co-developers to care for these issues? How do you test for compliance? Are those tests conducted manually or automatically?

Adding one possible answer upfront:

I've recently discovered fliptitle.com (they are providing an easy way to get weird characters written "uʍop ǝpısdn" *) and I'm planning on using them to provide easily verifiable UTF-8 character strings (as most of the characters used there are at some weird binary encoding position) but there surely must be more systematic tests, patterns or techniques for ensuring UTF-8 compatibility/usage.

Note: Even though there's an accepted answer, I'd like to know of more techniques and patterns if there are some. Please add more answers if you have more ideas. And it has not been easy choosing only one answer for acceptance. I've chosen the regexp answer for the least expected angle to tackle the problem although there would be reasons to choose other answers as well. Too bad only one answer can be accepted.

Thank you for your input.

*) that's "upside down" written "upside down" for those that cannot see those characters due to font problems

解决方案

There is a regular expression to test if a string is valid UTF-8:

$field =~
  m/\A(
     [\x09\x0A\x0D\x20-\x7E]            # ASCII
   | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
   |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
   |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
   | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
  )*\z/x;

But this doesn’t ensure that the text actual is UTF-8.

An example: The byte sequence for the letter ö (U+00F6) and the corresponding UTF-8 sequence is 0xC3B6.
So when you get 0xC3B6 as input you can say that it is valid UTF-8. But you cannot surely say that the letter ö has been submitted.
This is because imagine that not UTF-8 has been used but ISO 8859-1 instead. There the sequence 0xC3B6 represents the character à (0xC3) and ¶ (0xB6) respectivly.
So the sequence 0xC3B6 can either represent ö using UTF-8 or ö using ISO 8859-1 (although the latter is rather unusual).

So in the end it’s only guessing.

这篇关于如何测试应用程序的正确编码(如UTF-8)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆