正则表达式以匹配文件中的特定功能及其参数 [英] Regex to match specific functions and their arguments in files
问题描述
我正在使用gettext javascript解析器,并且卡在正则表达式中.
I'm working on a gettext javascript parser and I'm stuck on the parsing regex.
我需要捕获传递给特定方法调用_n(
和_(
的每个参数.例如,如果我的javascript文件中包含以下内容:
I need to catch every argument passed to a specific method call _n(
and _(
. For example, if I have these in my javascript files:
_("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
这引用了本文档: http://poedit.net/trac/wiki/Doc/Keywords
我计划分两次(和两个正则表达式)进行操作:
I'm planning in doing it in two times (and two regex):
- 捕获
_n(
或_(
方法调用的所有函数参数 - 只抓紧细线
- catch all function arguments for
_n(
or_(
method calls - catch the stringy ones only
基本上,我想要一个正则表达式,它可以说捕获_n(
或_(
之后的所有内容,并在功能完成后实际上在最后一个括号)
停止 .我不知道如果可以使用正则表达式而没有javascript解析器.
Basically, I'd like a Regex that could say "catch everything after _n(
or _(
and stop at the last parenthesis )
actually when the function is done. I dunno if it is possible with regex and without a javascript parser.
还可以做的是在_n(
或_(
之后捕获每个"字符串"或字符串",并在行的结尾处停止,或者在新的_n(
或_(
字符的开头停止
What could also be done is "catch every "string" or 'string' after _n(
or _(
and stop at the end of the line OR at the beginning of a new _n(
or _(
character.
在我做的所有事情中,我要么被卡在_( "one (optional)" );
上并带有内括号,要么被卡在apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)
上并且在同一行上有两次调用.
In everything I've done I get either stuck on _( "one (optional)" );
with its inside parenthesis or apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples)
with two calls on the same line.
这是我到目前为止使用不完善的正则表达式实现的内容: javascript一个或把手一
Here is what I implemented so far, with un-perfect regex: a generic parser and the javascript one or the handlebars one
推荐答案
注意: 如果您不熟悉递归,请阅读此答案.
Note: Read this answer if you're not familiar with recursion.
第1部分:匹配特定功能
谁说正则表达式不能模块化? PCRE正则表达式可助您一臂之力!
Part 1: match specific functions
Who said that regex can't be modular? Well PCRE regex to the rescue!
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
s
用于将换行符与.
匹配,并且x
修饰符用于此花哨的空格和对正则表达式的注释.
The s
is for matching newlines with .
and the x
modifier is for this fancy spacing and commenting of our regex.
由于我们的正则表达式也将获得左括号和右括号()
,因此我们可能需要对其进行过滤.我们将在结果上使用preg_replace()
Since our regex will also get the opening and closing brackets ()
, we might need to filter them. We will use preg_replace()
on the results:
~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x
所以这是另一个模块化正则表达式,您甚至可以添加自己的语法:
So here's another modular regex, you could even add your own grammar:
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&variable)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
~xis
我们将循环并使用preg_match_all()
.最终代码如下所示:
We will loop and use preg_match_all()
. The final code would look like this:
$functionPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
_n? # Match _ or _n
\s* # Optional white spaces
(?P<results>(?&brackets)) # Recurse/use the brackets pattern and put it in the results group
~sx
regex;
$argumentsPattern = <<<'regex'
~ # Delimiter
(?(DEFINE) # Start of definitions
(?P<str_double_quotes>
(?<!\\) # Not escaped
" # Match a double quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
" # Match the ending double quote
)
(?P<str_single_quotes>
(?<!\\) # Not escaped
' # Match a single quote
(?: # Non-capturing group
[^\\] # Match anything not a backslash
| # Or
\\. # Match a backslash and a single character (ie: an escaped character)
)*? # Repeat the non-capturing group zero or more times, ungreedy/lazy
' # Match the ending single quote
)
(?P<array>
Array\s*
(?&brackets)
)
(?P<variable>
[^\s,()]+ # I don't know the exact grammar for a variable in ECMAScript
)
(?P<brackets>
\( # Match an opening bracket
(?: # A non capturing group
(?&str_double_quotes) # Recurse/use the str_double_quotes pattern
| # Or
(?&str_single_quotes) # Recurse/use the str_single_quotes pattern
| # Or
(?&array) # Recurse/use the array pattern
| # Or
(?&variable) # Recurse/use the array pattern
| # Or
[^()] # Anything not a bracket
| # Or
(?&brackets) # Recurse the bracket pattern
)*
\)
)
) # End of definitions
# Let's start matching for real now:
(?&array)
|
(?&str_double_quotes)
|
(?&str_single_quotes)
|
(?&variable)
~six
regex;
$input = <<<'input'
_ ("foo") // want "foo"
_n("bar", "baz", 42); // want "bar", "baz", 42
_n(domain, "bux", var); // want domain, "bux", var
_( "one (optional)" ); // want "one (optional)"
apples === 0 ? _( "No apples" ) : _n("%1 apple", "%1 apples", apples) // could have on the same line two calls..
// misleading cases
_n("foo (")
_n("foo (\)", 'foo)', aa)
_n( Array(1, 2, 3), Array(")", '(') );
_n(function(foo){return foo*2;}); // Is this even valid?
_n (); // Empty
_ (
"Foo",
'Bar',
Array(
"wow",
"much",
'whitespaces'
),
multiline
); // PCRE is awesome
input;
if(preg_match_all($functionPattern, $input, $m)){
$filtered = preg_replace(
'~ # Delimiter
^ # Assert begin of string
\( # Match an opening bracket
\s* # Match optional whitespaces
| # Or
\s* # Match optional whitespaces
\) # Match a closing bracket
$ # Assert end of string
~x', // Regex
'', // Replace with nothing
$m['results'] // Subject
); // Getting rid of opening & closing brackets
// Part 3: extract arguments:
$parsedTree = array();
foreach($filtered as $arguments){ // Loop
if(preg_match_all($argumentsPattern, $arguments, $m)){ // If there's a match
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => $m[0]
); // Add an array to our tree and fill it
}else{
$parsedTree[] = array(
'all_arguments' => $arguments,
'branches' => array()
); // Add an array with empty branches
}
}
print_r($parsedTree); // Let's see the results;
}else{
echo 'no matches';
}
您可能想创建一个递归函数来生成完整的树. 查看此答案.
You might want to create a recursive function to generate a full tree. See this answer.
您可能会注意到function(){}
部分没有正确解析.我将其作为读者的练习:)
You might notice that the function(){}
part isn't parsed correctly. I will let that as an exercise for the readers :)
这篇关于正则表达式以匹配文件中的特定功能及其参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!