设计决策:将JSON中的西里尔字符与PHP匹配 [英] Design decision: Matching cyrillic chars in JSON with PHP

查看:124
本文介绍了设计决策:将JSON中的西里尔字符与PHP匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为CMS开发一个插件,并且有一个意想不到的问题:因为插件是启用了多边形的,所以输入可以是任何unicode字符集。插件以 json 格式保存数据,并包含属性 value lookup 的对象。对于 value ,一切都很好,但是PHP使用 lookup 属性来检索这些实体,并且在某些点regexes(内容过滤器)。
问题是:

I'm developing a plugin for a CMS and have an unanticipated problem: because the plugin is multilang-enabled, input can be of any of the unicode character sets. The plugin saves data in json format, and contains objects with properties value and lookup. For value everything is fine, but the lookup property is used by PHP to retrieve these entities, and at certain points through regexes (content filters). The problems are:


  1. 对于非拉丁字符(例如Экспорт), \ w (word-char)在一个正则表达式匹配什么。 有没有什么方法可以识别西里尔字符为字字符?任何其他隐藏的抓图?

  2. 数据格式为JSON,非拉丁字符转换为JS Unicode,例如: \\\Э \\\к\\\с\\\п\\\о\\\р\\\т 是否安全 这样做?(服务器限制等)

  1. For non-latin characters (eg. Экспорт), the \w (word-char) in a regex matches nothing. Is there any way to recognize cyrillic chars as word chars? Any other hidden catches?
  2. The data format being JSON, non-latin characters are converted to JS unicodes, eg for the above: \u042D\u043A\u0441\u043F\u043E\u0440\u0442. Is it safe not to do this? (server restrictions etc.)

以及我之前提到的两个问题的大'设计'问题:

And the big 'design' question I have stems from the previous 2 problems:

我应该允许非拉丁字母语言的用户使用自己的字符 lookup 属性或者我应该强制他们传统的word字符,即a,b,c等+下划线(因此从另一种语言的字母表)?

Should I either allow users with non-Latin alphabet languages to use their own chars for the lookup properties or should I force them to traditional 'word' chars, that is a,b,c etc. + underscore (thus an alphabet from another language)? I'd welcome a technical advice to guide this decision (not a UX one).

推荐答案

我们欢迎您提供技术建议第一个问题



First question


对于非拉丁字符(例如Экспорт), \w (word-char)在一个正则表达式匹配什么。有什么办法识别西里尔字符作为字字吗?任何其他隐藏的抓取?

For non-latin characters (eg. Экспорт), the \w (word-char) in a regex matches nothing. Is there any way to recognize cyrillic chars as word chars? Any other hidden catches?

您只需打开 u

preg_match("#^\w+$#u", $str);

演示

PHP文档在此处具有误导性:


u( PCRE_UTF8

此修饰符打开与Perl不兼容的PCRE的其他功能。模式和主题字符串被视为UTF-8。这个修改器可以从Unix上的PHP 4.1.0或更高版本以及win32上的PHP 4.2.3获得。 UTF-8的模式和主题的有效性检查自PHP 4.3.5。一个无效的主体会导致preg_ *函数不匹配;无效模式将触发电平E_WARNING的错误。五个和六个八位字节的UTF-8序列被视为无效,因为PHP 5.3.4(或PCRE 7.3 2007-08-28);以前被认为是有效的UTF-8。

u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8.

我说这是误导,因为从上面的ideone测试,它不仅启用PCRE_UTF8

I say it's misleading because from the ideone test above, it not only enables PCRE_UTF8 but also PCRE_UCP (Unicode Character Properties) which is the behavior you want here.

这里是PCRE文档对它的描述:

Here's what the PCRE docs say about it:


PCRE_UTF8

此选项使PCRE同时考虑模式和主题为UTF-8字符串,而不是单字节字符串。但是,它仅在PCRE构建为包括UTF支持时可用。如果没有,使用此选项会引发错误。此选项更改PCRE行为的详细信息在pcreunicode页面中给出。

PCRE_UTF8
This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

PCRE_UCP

此选项更改PCRE处理 \B \b \D \d \S \s \W \w POSIX字符类。默认情况下,只能识别ASCII字符,但如果设置了 PCRE_UCP ,则使用Unicode属性来分类字符。更多详细信息在pcrepattern页面中的通用字符类型部分中给出。如果您设置 PCRE_UCP ,匹配其中影响的项目需要更长的时间。该选项仅在PCRE已使用Unicode属性支持编译时可用。

PCRE_UCP
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.

如果您想使其显而易见首先, PCRE_UCP 标志将被设置,你可以将它插入模式本身,在开始,像这样:

If you want to make it obvious at first sight the PCRE_UCP flag will be set, you can insert it into the pattern itself, at the start, like that:

preg_match("#(*UCP)^\w+$#u", $str);




另一个可能出现在模式开头的特殊序列是(* UCP)。这与设置 PCRE_UCP 选项的效果相同:它会导致序列 \d \ w 使用Unicode属性来确定字符类型,而不是通过查找表仅识别代码小于128的字符。

Another special sequence that may appear at the start of a pattern is (*UCP). This has the same effect as setting the PCRE_UCP option: it causes sequences such as \d and \w to use Unicode properties to determine character types, instead of recognizing only characters with codes less than 128 via a lookup table.



第二个问题



Second question


数据格式为JSON,非拉丁字符转换为JS unicode,例如对于上述: \\\Э\\\к\\\с\\\п\\\о\\\р\\\т 。是否安全不这样做? (服务器限制等)

The data format being JSON, non-latin characters are converted to JS unicodes, eg for the above: \u042D\u043A\u0441\u043F\u043E\u0440\u0442. Is it safe not to do this? (server restrictions etc.)

只要您的 Content-Type header定义正确的编码。

It's safe not to do this as long as your Content-Type header defines the right encoding.

因此,您可能需要使用类似:

So you may want to use something like:

header('Content-Type: application/json; charset=utf-8');

并确保您实际上以UTF8发送。

And make sure you actually send it in UTF8.

但是,在转义序列中编码这些字符会使整个事物与ASCII兼容,因此您基本上可以通过这种方式完全消除问题。

However, encoding these characters in escape sequences makes the whole thing ASCII compatible, so you basically eliminate the problem altogether in this way.


我应该允许非拉丁字母语言的用户使用自己的字符进行查找属性或者我应该强制他们传统的字字符,即a,b,c等+下划线(因此从另一种语言的字母表)?

Should I either allow users with non-Latin alphabet languages to use their own chars for the lookup properties or should I force them to traditional 'word' chars, that is a,b,c etc. + underscore (thus an alphabet from another language)? I'd welcome a technical advice to guide this decision (not a UX one).

技术上来说,只要你的整个栈支持Unicode(浏览器,PHP,数据库等)我看到没有问题的这种方法。只要确保测试并在数据库中使用支持Unicode的列类型。

Technically, as long as your whole stack supports Unicode (Browser, PHP, Database etc) I see no problem with this approach. Just make sure to test it well and to use Unicode-enabled column types in your DB.

小心,PHP是一种可怕的字符串支持语言,所以你必须请确保使用正确的函数(避免非Unicode感知的函数,例如 strlen 等,除非你真的想要字节计数)。

Be careful, PHP is a terrible language for string support, so you have to make sure you use the right functions (avoid non-Unicode aware ones like strlen etc unless you really want the byte count).

这可能有点更多的工作,以确保一切工作,因为它应该是,但如果这是你想要支持的东西,这没有问题。

It may be a bit more work to make sure everything works like it's supposed to, but if that's something you want to support there's no problem with that.

这篇关于设计决策:将JSON中的西里尔字符与PHP匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆