日语正则表达式(regex) [英] Regular expressions (regex) in Japanese

查看:1291
本文介绍了日语正则表达式(regex)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在学习英语的正则表达式(regex),尽管某些概念似乎适用于日文等其他语言,但我觉得很多其他概念都不适用.例如,正则表达式的常见用法是查找单词是否具有非字母数字字符.我不知道这种技术以及其他技术如何在日语中使用,因为不仅有三种书写系统,而且汉字也非常复杂,并且其范围比字母数字字符大得多.尽管我参加了许多日语课程,但我对这个主题的了解很少,所以我希望能获得与该主题有关的任何信息以及需要深入研究的领域.如果可能的话,我希望您能使用python和Java,因为这些是我很喜欢的语言.谢谢您的帮助.

I am learning about Regular expressions (regex) for English and although some of the concepts seem like they would apply to other languages such as Japanese, I feel as if many others would not. For example, a common use of regex is to find if a word has non alphanumeric characters. I don't see how this technique as well as others would work for Japanese as there are not only three writing systems, but kanji are also very complex and span a much greater range than alpha numeric characters do. I would appreciate any information on this topic as well as areas to look into more as I have very little knowledge on the subject although I have taken many Japanese courses. If at all possible, I would like your answers to use python and Java as those are the languages I am comfortable with. Thank you for your help.

推荐答案

Python regexes对Unicode功能提供了有限的支持. Java更好,特别是Java 7.

Python regexes offer limited support for Unicode features. Java is better, particularly Java 7.

Java支持Unicode类别.例如,\p{L}(及其缩写,\pL)与任何语言的任何字母匹配.这包括日语表意字符.

Java supports Unicode categories. E.g., \p{L} (and its shorthand, \pL) matches any letter in any language. This includes Japanese ideographic characters.

Java 7支持Unicode脚本,包括通常由日语文本组成的平假名,片假名,韩语和拉丁语脚本.您可以使用\p{Han}\p{Hiragana}\p{Katakana}\p{Latin}在这些脚本之一中匹配任何字符.您可以将它们组合为字符类,例如[\p{Han}\p{Hiragana}\p{Katakana}].您可以使用大写字母P(如\P{Han})来匹配除Han脚本中的字符以外的任何字符.

Java 7 supports Unicode scripts, including the Hiragana, Katakana, Han, and Latin scripts that Japanese text is typically composed of. You can match any character in one of these scripts using \p{Han}, \p{Hiragana}, \p{Katakana}, and \p{Latin}. You can combine them in a character class such as [\p{Han}\p{Hiragana}\p{Katakana}]. You can use an uppercase P (as in, \P{Han}) to match any character except those in the Han script.

Java 7支持Unicode块.除非在Android中运行代码(脚本不可用),否则通常应避免使用块,因为它们不如Unicode脚本有用和准确.与日语文本相关的块有很多,包括\p{InHiragana}\p{InKatakana}\p{InCJK_Unified_Ideographs}\p{InCJK_Symbols_and_Punctuation}等.

Java 7 supports Unicode blocks. Unless running your code in Android (where scripts are not available), you should generally avoid blocks, since they are less useful and accurate than Unicode scripts. There are a variety of blocks related to Japanese text, including \p{InHiragana}, \p{InKatakana}, \p{InCJK_Unified_Ideographs}, \p{InCJK_Symbols_and_Punctuation}, etc.

Java和Python都可以使用\uFFFF引用各个代码点,其中FFFF是任何四位数的首字母数字. Java 7可以使用以下方式引用任何Unicode代码点,包括基本多语言平面之外的代码点. \x{10FFFF}. Python正则表达式不支持21位Unicode,但Python字符串支持,因此您可以使用以下代码在正则表达式中嵌入代码点: \U0010FFFF(大写的U后跟八个十六进制数字).

Both Java and Python can refer to individual code points using \uFFFF, where FFFF is any four-digit headecimal number. Java 7 can refer to any Unicode code point, including those beyond the Basic Multilingual Plane, using e.g. \x{10FFFF}. Python regexes don't support 21-bit Unicode, but Python strings do, so you can embed a a code point in a regex using e.g. \U0010FFFF (uppercase U followed by eight hex digits).

Java 7 (?U)UNICODE_CHARACTER_CLASS标志使字符类速记(例如\w\d)能够识别Unicode,因此它们将匹配日文表意字符等(但请注意,\d仍将不匹配)汉字(例如一二三四). Python 3默认使速记类能够识别Unicode.在Python 2中,当您使用re.UNICODEre.U标志时,速记类可识别Unicode.

The Java 7 (?U) or UNICODE_CHARACTER_CLASS flag makes character class shorthands like \w and \d Unicode aware, so they will match Japanese ideographic characters, etc. (but note that \d will still not match kanji for numbers like 一二三四). Python 3 makes shorthand classes Unicode aware by default. In Python 2, shorthand classes are Unicode aware when you use the re.UNICODE or re.U flag.

您是对的,并不是所有的正则表达式都可以很好地适用于所有脚本.有些东西(例如字母大写)对于日语文本是没有意义的.

You're right that not all regex ideas carry over equally well to all scripts. Some things (such as letter casing) just don't make sense with Japanese text.

这篇关于日语正则表达式(regex)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆