正则表达式无法正确使用土耳其语字符 [英] Regular Expression Doesn't Work Properly With Turkish Characters
问题描述
我写一个应该提取以下模式的正则表达式:
I write a regex that should extracts following patterns;
- çççoookkkgggüüüzzzeeelll(这意味着vvveeerrryyy gggoooddd用土耳其字符ç和ü)
- ccccoookkk ggguuuzzzeeelll(这意味着相同但英文字符为c和u)
这里是正在尝试的正则表达式;
here is the regular expressions i'm trying;
-
\\ b [çc] + o + k + \sg + [üu] + z + e + l + \b
:此功能在英语中有效,但不在土耳其语字符中 -
çok
:找到çok,但是当我尝试ç+ o + k +
t工作的çççoookkk,它找到çoookkk -
güzel
:找到güzel c $ c>g +ü+ z + e + l +对gggüüüzzzeeelll无效 -
\ b(c + o + k +)|(ç+ o + k +)\s(g + u + z + e + l)|(g +ü+ z + e + 1 +)\b
:无法正常工作 -
[çc] ok\sg [uü] zel
"\b[çc]+o+k+\sg+[üu]+z+e+l+\b"
: this works in english but not in turkish characters"çok"
: finds "çok" but when i try"ç+o+k+"
doesn't work for "çççoookkk", it finds "çoookkk""güzel"
: finds "güzel" but when i try"g+ü+z+e+l+"
doesn't work for "gggüüüzzzeeelll""\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b"
: doesn't work properly"[çc]ok\sg[uü]zel"
: I also tried this to get "çok güzel" pattern but doesn't work neither.
我的问题可能是使用土耳其字符的正则表达式运算符。我不知道我该如何解决这个问题。
I thing the problem might be using regex operators with turkish characters. I don't know how can i solve this.
我使用 http://www.myregextester.com < a>检查我的正则表达式是否正确。
I am using http://www.myregextester.com to check if my regular expressions are correct.
我使用Php编程语言通过Twitter Rest Api从搜索的tweet中获取特定模式。
I am using Php programming language to get a specific pattern from searched tweets via Twitter Rest Api.
感谢,
推荐答案
您尚未指定使用哪种编程语言,其中许多, \b
字符类只能使用纯ASCII编码。
You have not specified what programming language you are using, but in many of them, the \b
character class can only be used with plain ASCII encoding.
code> \b 作为 \w
和 \W
设置。
依次, \w
等于 [a-zA-Z0-9_]
。
Internally, \b
is processed as a boundary between \w
and \W
sets.
In turn, \w
is equal to [a-zA-Z0-9_]
.
如果你不使用任何奇怪的空格标记(你不应该),那么可以考虑使用常规的空格char类$ c> \s )。
If you are not using any fancy space marks (you shouldn't), then consider using regular whitespace char classes (\s
).
请参阅此表(向下滚动到 Word Boundaries 部分)以检查您的语言是否支持 \b
。如果它说,ascii,则不会。
See this table (scroll down to Word Boundaries section) to check if your language supports Unicode for \b
. If it says, "ascii", then it does not.
另外,根据您的编程语言,
As a side note, depending on your programming language, you may consider using direct Unicode code points instead of national characters.
也可以: utf-8 javascript 中的字边界正则表达式
Se also: utf-8 word boundary regex in javascript
进一步阅读:
- a href =http://www.regular-expressions.info/unicode.html =nofollow>有关在正则表达式中使用Unicode字符的精彩文章
- 一篇关于字边界的文章
- 土耳其语Unicode码点列表
- An excellent article about using Unicode characters in regular expressions
- An article for word boundaries
- List of Turkish Unicode code points
这篇关于正则表达式无法正确使用土耳其语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!