正则表达式无法正确使用土耳其语字符 [英] Regular Expression Doesn't Work Properly With Turkish Characters

查看:259
本文介绍了正则表达式无法正确使用土耳其语字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写一个应该提取以下模式的正则表达式:

I write a regex that should extracts following patterns;


  • çççoookkkgggüüüzzzeeelll(这意味着vvveeerrryyy gggoooddd用土耳其字符ç和ü)

  • ccccoookkk ggguuuzzzeeelll(这意味着相同但英文字符为c和u)

这里是正在尝试的正则表达式;

here is the regular expressions i'm trying;


  • \\ b [çc] + o + k + \sg + [üu] + z + e + l + \b:此功能在英语中有效,但不在土耳其语字符中

  • çok:找到çok,但是当我尝试ç+ o + k + t工作的çççoookkk,它找到çoookkk

  • güzel:找到güzel c $ c>g +ü+ z + e + l +对gggüüüzzzeeelll无效

  • \ b(c + o + k +)|(ç+ o + k +)\s(g + u + z + e + l)|(g +ü+ z + e + 1 +)\b:无法正常工作

  • [çc] ok\sg [uü] zel

  • "\b[çc]+o+k+\sg+[üu]+z+e+l+\b" : this works in english but not in turkish characters
  • "çok": finds "çok" but when i try "ç+o+k+" doesn't work for "çççoookkk", it finds "çoookkk"
  • "güzel": finds "güzel" but when i try "g+ü+z+e+l+" doesn't work for "gggüüüzzzeeelll"
  • "\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b": doesn't work properly
  • "[çc]ok\sg[uü]zel": I also tried this to get "çok güzel" pattern but doesn't work neither.

我的问题可能是使用土耳其字符的正则表达式运算符。我不知道我该如何解决这个问题。

I thing the problem might be using regex operators with turkish characters. I don't know how can i solve this.

我使用 http://www.myregextester.com < a>检查我的正则表达式是否正确。

I am using http://www.myregextester.com to check if my regular expressions are correct.

我使用Php编程语言通过Twitter Rest Api从搜索的tweet中获取特定模式。

I am using Php programming language to get a specific pattern from searched tweets via Twitter Rest Api.

感谢,

推荐答案

您尚未指定使用哪种编程语言,其中许多, \b 字符类只能使用纯ASCII编码。

You have not specified what programming language you are using, but in many of them, the \b character class can only be used with plain ASCII encoding.

code> \b 作为 \w \W 设置。

依次, \w 等于 [a-zA-Z0-9_]

Internally, \b is processed as a boundary between \w and \W sets.
In turn, \w is equal to [a-zA-Z0-9_].

如果你不使用任何奇怪的空格标记(你不应该),那么可以考虑使用常规的空格char类$ c> \s )。

If you are not using any fancy space marks (you shouldn't), then consider using regular whitespace char classes (\s).

请参阅此表(向下滚动到 Word Boundaries 部分)以检查您的语言是否支持 \b 。如果它说,ascii,则不会。

See this table (scroll down to Word Boundaries section) to check if your language supports Unicode for \b. If it says, "ascii", then it does not.

另外,根据您的编程语言,

As a side note, depending on your programming language, you may consider using direct Unicode code points instead of national characters.

也可以: utf-8 javascript 中的字边界正则表达式

Se also: utf-8 word boundary regex in javascript

进一步阅读:

  • An excellent article about using Unicode characters in regular expressions
  • An article for word boundaries
  • List of Turkish Unicode code points

这篇关于正则表达式无法正确使用土耳其语字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆