VBA使用正则表达式将希腊语单词作为整个单词匹配 [英] VBA match Greek words as whole words using regular expression
问题描述
我正在尝试以正则表达式模式匹配希腊字符.似乎VBA在支持Unicode方面非常有限,但如果可能的话,我可以使用ASCII.这是示例代码:
I am trying to match greek characters in a regex pattern. It seems that VBA is quite limited on supporting Unicode, but I can do with ASCII if possible. Here is a sample code:
Sub TestGreekRegEx()
Dim str
str = "αυτό είναι ένα ελληνικό κείμενο"
Set regEx = CreateObject("vbscript.regexp")
regEx.Pattern = "\b[\xe1-\xfe]+\b"
Set Matches = regEx.Execute(str)
For Each Match In Matches
MsgBox Match
Next
End Sub
此命令完全不返回任何匹配项.另外,如果我循环输入str的字符,则我得到的ASCII码在\ xE1到\ xFE的范围内.
This return no matches at all. Also, if I loop in the str's character the ASCII codes that I get is within the range of \xE1 to \xFE.
谢谢
推荐答案
不考虑希腊字符Unicode字符范围,您还有另一个问题:\b
在ECMAScript 5标准中,只有ASCII匹配边界.
Leaving the Greek char Unicode character range aside, you have another problem: \b
in ECMAScript 5 standard only ASCII matches boundaries.
因此,无论您使用哪种希腊语单词模式[\u00E1-\u03CE]+
或[\xE1-\xFE]+
,如果在两端添加\b
,都不会匹配.
Thus, whatever Greek word pattern works for you, [\u00E1-\u03CE]+
or [\xE1-\xFE]+
, if you add \b
s on both ends, you won't get a match.
因此,您需要做的是使用一个组(左侧)和一个前瞻(右侧)建立自定义边界.要提取单词,您需要访问每个匹配项的.Submatches
属性.
So, what you need to do is build custom boundaries using a group (on the left) and a lookahead (on the right). To extract the words, you will need to access the .Submatches
property of each match.
我没有为非Unicode文件设置希腊语设置,所以让我想象一下您的单词模式是[\xE1-\xFE]+
.然后,您的提取正则表达式将如下所示:
I do not have Greek language settings set for non-Unicode files, so let me imaging your word pattern is [\xE1-\xFE]+
. Then, your extracting regex will look like
(?:^|[^_0-9\xE1-\xFE])([\xE1-\xFE]+)(?![_0-9\xE1-\xFE])
使用[\u00E1-\u03CE]+
模式,它将看起来像
With [\u00E1-\u03CE]+
pattern, it will look like
(?:^|[^_0-9\u00E1-\u03CE])([\u00E1-\u03CE]+)(?![_0-9\u00E1-\u03CE])
请注意,我模仿的是\b
字边界,其左侧为(?:^|[^_0-9\xE1-\xFE])
(它匹配字符串的开头或任何字符但_
,数字和您字符范围内的字母)和右侧的(?![_0-9\xE1-\xFE])
(没有数字,_
,并且在单词匹配模式之后立即允许使用您的字符).请注意,单词匹配模式会用括号括起来,以捕获成组. 问题"是非捕获组((?:^|[^_0-9\xE1-\xFE])
)匹配也将落入结果中.这就是为什么我们需要访问.Submatches
:
Note I am imitating \b
word boundaries with (?:^|[^_0-9\xE1-\xFE])
on the left (it matches start of string or any char BUT _
, digit and the letter from your character range) and (?![_0-9\xE1-\xFE])
on the right (no digit, _
and your chars allowed right after the word matching pattern). Note the word matching pattern is wrapped with parentheses to capture it into a group. The "problem" is that the non-capturing group ((?:^|[^_0-9\xE1-\xFE])
) match also lands in the result. That is why we need to access .Submatches
:
Sub TestGreekRegEx()
Dim str
str = "YOUR_NON_ASCII_STRING_HERE"
Set regEx = CreateObject("vbscript.regexp")
regEx.Pattern = "(?:^|[^_0-9\xE1-\xFE])([\xE1-\xFE]+)(?![_0-9\xE1-\xFE])"
Set Matches = regEx.Execute(str)
For Each Match In Matches
MsgBox Match.Submatches(0) ' <--- See here
Next
End Sub
这篇关于VBA使用正则表达式将希腊语单词作为整个单词匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!