将随机Unicode用于字符串中的复杂定界符序列是否安全? [英] Is it safe to use random Unicode for complex delimiter sequences in strings?

查看:67
本文介绍了将随机Unicode用于字符串中的复杂定界符序列是否安全?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:就程序稳定性和确保系统实际运行而言,使用 ¦ §用于字符串中的复杂定界符序列?我可以可靠地相信我不会在程序中错误读取这些问题吗?






我正在从事一个使用C#代码的系统,其中我必须在一个字符串中存储一组相当复杂的信息。该字符串的可读性仅在计算机方面是必需的,最终用户仅应在通过适当方法对其进行了解析之后才能看到该信息。因为这些字符串中的某些数据将是可变大小的集合,所以我使用不同的定界符来标识字符串的哪些部分对应于组织的特定层。在足够多的情况下,;,|和类似的ilk的标准集已用尽。我考虑了两个字符的定界符,例如;#或; |,但是我觉得这样效率很低。在存储一个字符和两个字符时,性能可能没有很大的差异,但是当我选择较小的选项时,选择较大的感觉就是错误的。 / p>

最后,我考虑使用像双匕首和小节这样的字符集。它们只占用一个字符,并且绝对不会出现在我要存储的实际文本中,因此不会对任何内容感到困惑。



但是字符编码是挑剔的。尽管对最终用户的可见性毫无意义(因为实际上他们看不到),但我最近开始担心系统中的程序如何读取它。该字符串存储在一个数据库中,而一个单独的程序负责将字符串编码和解码为不同的对象类型,以供应用程序的其余部分使用。而且,如果期望以一种方式编写某些内容,或者以另一种方式编写某些内容,那么整个系统可能会失败,而我真的不能让这种情况发生。因此,将此类字符用作背景定界符是否安全?

解决方案

存储和存储任何系统的危险很小



在文本传输过程中可以更改的主要字符是行标记的结尾。例如,以文本模式通过FTP将文件从Unix系统传输到Windows系统可能会将LINE FEED字符替换为CARRIAGE RETURN + LINE FEED对。



之后,某些系统可以执行文本的规范化。除非考虑规范的归一化(合成或分解),否则不应使用字符和带有变音符号的字符的组合。 Unicode字符数据库包含有关在这些归一化方案下需要进行哪些转换的信息。



这总结了需要注意的最大事情,而这些都不是问题



可能进行但不太可能发生的其他转换是大小写更改和兼容性规范化。为避免这些情况,请远离字母或任何看起来像字母的东西。某些符号也以兼容性归一化方式进行转换,因此,请确保检查Unicode字符数据库中的属性。但是,任何系统都不太可能在未明确表示会进行兼容性标准化的情况下进行兼容性标准化。



Unicode代码图,规范化规范化由≡表示,兼容性规范化由≈表示。


Question: In terms of program stability and ensuring that the system will actually operate, how safe is it to use chars like ¦, § or for complex delimiter sequences in strings? Can I reliable believe that I won't run into any issues in a program reading these incorrectly?


I am working in a system, using C# code, in which I have to store a fairly complex set of information within a single string. The readability of this string is only necessary on the computer side, end-users should only ever see the information after it has been parsed by the appropriate methods. Because some of the data in these strings will be collections of variable size, I use different delimiters to identify what parts of the string correspond to a certain tier of organization. There are enough cases that the standard sets of ;, |, and similar ilk have been exhausted. I considered two-char delimiters, like ;# or ;|, but I felt that it would be very inefficient. There probably isn't that large of a performance difference in storing with one char versus two chars, but when I have the option of picking the smaller option, it just feels wrong to pick the larger one.

So finally, I considered using the set of characters like the double dagger and section. They only take up one char, and they are definitely not going to show up in the actual text that I'll be storing, so they won't be confused for anything.

But character encoding is finicky. While the visibility to the end user is meaningless (since they, in fact, won't see it), I became recently concerned about how the programs in the system will read it. The string is stored in one database, while a separate program is responsible for both encoding and decoding the string into different object types for the rest of the application to work with. And if something is expected to be written one way, is possibly written another, then maybe the whole system will fail and I can't really let that happen. So is it safe to use these kind of chars for background delimiters?

解决方案

There is very little danger that any system that stores and retrieves Unicode text will alter those specific characters.

The main characters that can be altered in a text transfer process are the end of line markers. For example, FTPing a file from a Unix system to a Windows system in text mode might replace LINE FEED characters for CARRIAGE RETURN + LINE FEED pairs.

After that, some systems may perform a canonical normalization of the text. Combining characters and characters with diacritics on them should not be used unless canonical normalization (either composing or decomposing) is taken into account. The Unicode character database contains information about which transformations are required under these normalization schemes.

That sums up the biggest things to watch out for, and none of them are a problem for the characters that you have listed.

Other transformations that might be made, but are less likely, are case changes and compatibility normalizations. To avoid these, just stay away from alphabetic letters or anything that looks like an alphabetic letter. Some symbols are also converted in a compatibility normalization, so you should check the properties in the Unicode Character Database just to be sure. But it is unlikely that any system will do a compatibility normalization without expressly indicating that it will do so.

In the Unicode Code Charts, cannonical normalizations are indicated by "≡" and compatability normalizations are indicated by "≈".

这篇关于将随机Unicode用于字符串中的复杂定界符序列是否安全?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆