[a-z]是否会与PREG/PCRE中的重音字符匹配? [英] Will [a-z] ever match accented characters in PREG/PCRE?

查看:98
本文介绍了[a-z]是否会与PREG/PCRE中的重音字符匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经知道PCRE(尤其是PHP的实现)中的\w有时可以匹配某些非ASCII字符,具体取决于系统的语言环境,但是[a-z]呢?

I'm already aware that \w in PCRE (particularly PHP's implementation) can sometimes match some non-ASCII characters depending on the locale of the system, but what about [a-z]?

我不这么认为,但是我在Drupal的一个核心文件(包括include/theme.inc,简化后)中注意到了以下几行:

I wouldn't think so, but I noticed these lines in one of Drupal's core files (includes/theme.inc, simplified):

// To avoid illegal characters in the class,
// we're removing everything disallowed. We are not using 'a-z' as that might leave
// in certain international characters (e.g. German umlauts).
$body_classes[] = preg_replace('![^abcdefghijklmnopqrstuvwxyz0-9-_]+!s', '', $class);

这是真的吗?还是有人将[a-z]\w混淆了?

Is this true, or did someone simply get [a-z] confused with \w?

推荐答案

长话短说:也许,取决于应用程序所部署的系统,取决于PHP的编译方式,欢迎使用本地化和国际化的CF.

Long story short: Maybe, depends on the system the app is deployed to, depends how PHP was compiled, welcome to the CF of localization and internationalization.

确定"a-z"的含义时,基础PCRE引擎会考虑区域设置.在基于西班牙语的语言环境中,ñ会被a-z捕获). a-z的语义是"a和z之间的所有字母,而ñ是西班牙语中的单独字母.

但是,PHP盲目地将字符串作为字节的集合而不是UTF代码点的集合来处理字符串的方式意味着您可能遇到a-z MIGHT匹配重音字符的情况.考虑到Drupal部署到各种不同的系统上,他们选择对允许的字符明确是有意义的,而不是仅仅信任a-z来做正确的事情.

However, the way PHP blindly handles strings as collections of bytes rather than a collection of UTF code points means you have a situation where a-z MIGHT match an accented character. Given the variety of different systems Drupal gets deployed to, it makes sense that they would choose to be explicit about the allowed characters rather than just trust a-z to do the right thing.

我还可以猜想,该正则表达式的存在是由于提交了有关未过滤德国变音符号的错误报告的结果.

2014年更新:根据 JimmiTh的回答,它看起来(尽管有些令非pcre核心开发人员感到困惑"文档),[a-z]仅在99%的时间里与字符abcdefghijklmnopqrstuvwxyz匹配.就是说-框架开发人员往往对代码的含糊不清感到不安,尤其是当代码依赖于PHP无法如您所愿地优雅地处理系统(特定于语言环境的字符串)并且服务器无法控制开发人员时.尽管匿名Drupal开发人员的评论不正确-并不是让[a-z]\w混淆",而是Drupal开发人员不清楚/不确定PCRE如何处理[a-z],而是选择了更具体的方法abcdefghijklmnopqrstuvwxyz的形式,以确保他们想要的特定行为.

Update in 2014: Per JimmiTh's answer below, it looks like (despite some "confusing-to-non-pcre-core-developers" documentation) that [a-z] will only match the characters abcdefghijklmnopqrstuvwxyz a proverbial 99% of the time. That said — framework developers tend to get twitchy about vagueness in their code, especially when the code relies on systems (locale specific strings) that PHP doesn't handle as gracefully as you'd like, and servers the developers have no control over. While the anonymous Drupal developer's comments are incorrect — it wasn't a matter of "getting [a-z] confused with \w", but instead a Drupal developer being unclear/unsure of how PCRE handled [a-z], and choosing the more specific form of abcdefghijklmnopqrstuvwxyz to ensure the specific behavior they wanted.

这篇关于[a-z]是否会与PREG/PCRE中的重音字符匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆