匹配任何语言的字母 [英] Match letter in any language

查看:45
本文介绍了匹配任何语言的字母的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在 python 3 中使用正则表达式匹配来自任何语言的字母?

How can I match a letter from any language using a regex in python 3?

re.match([a-zA-Z]) 将匹配英语字符,但我希望同时支持所有语言.

re.match([a-zA-Z]) will match the english language characters but I want all languages to be supported simultaneously.

我不希望匹配 can't 中的 ' 或下划线或任何其他类型的格式.我希望我的正则表达式匹配:cantÅé.

I don't wish to match the ' in can't or underscores or any other type of formatting. I do wish my regex to match: c, a, n, t, Å, é, and .

推荐答案

对于 Python 中的 Unicode 正则表达式工作,我强烈推荐以下内容:

For Unicode regex work in Python, I very strongly recommend the following:

  1. 使用 Matthew Barnett 的 regex 而不是标准re,这不太适合 Unicode 正则表达式.
  2. 仅使用 Python 3,切勿使用 Python 2.您希望所有字符串都是 Unicode 字符串.
  3. 仅使用带有逻辑/抽象 Unicode 代码点的字符串文字,而不是编码的字节字符串.
  4. 在您的流上设置您的编码并忘记它.如果您发现自己曾经手动调用 .encode 等,那么您几乎肯定做错了.
  5. 仅使用代码点和代码单元相同的广泛构建,永远不要使用狭窄构建 - 您最好考虑弃用 Unicode 稳健性.
  6. 在输入时将所有传入字符串标准化为 NFD,然后在输出时标准化为 NFC.否则,您将无法获得可靠的行为.
  1. Use Matthew Barnett’s regex library instead of standard re, which is not really suitable for Unicode regular expressions.
  2. Use only Python 3, never Python 2. You want all your strings to be Unicode strings.
  3. Use only string literals with logical/abstract Unicode codepoints, not encoded byte strings.
  4. Set your encoding on your streams and forget about it. If you find yourself ever manually calling .encode and such, you’re almost certainly doing something wrong.
  5. Use only a wide build where code points and code units are the same, never ever ever a narrow one — which you might do well to consider deprecated for Unicode robustness.
  6. Normalize all incoming strings to NFD on the way in and then NFC on the way out. Otherwise you can’t get reliable behavior.

一旦你这样做,你就可以安全地编写包含 \w\p{script=Latin}\p{alpha} 的模式code> 和 \p{lower} 等,并且知道这些都会做 Unicode 标准说他们应该.我更详细地解释了 Python Unicode 正则表达式业务的所有这些业务 在这个答案中.简短的故事是始终使用 regex 而不是 re.

Once you do this, you can safely write patterns that include \w or \p{script=Latin} or \p{alpha} and \p{lower} etc and know that these will all do what the Unicode Standard says they should. I explain all of this business of Python Unicode regex business in much more detail in this answer. The short story is to always use regex not re.

对于一般的 Unicode 建议,我还有 上次 OSCON 的几场演讲关于 Unicode 常规表达式,除了第 3 次演讲之外,大部分内容都不是关于 Python,但其中大部分内容是可调整的.

For general Unicode advice, I also have several talks from last OSCON about Unicode regular expressions, most of which apart from the 3rd talk alone is not about Python, but much of which is adaptable.

最后,总会有这个答案 把对上帝(或至少是对 Unicode)的敬畏放在心中.

Finally, there’s always this answer to put the fear of God (or at least, of Unicode) in your heart.

这篇关于匹配任何语言的字母的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆