正则表达式解析define()内容,可能吗? [英] Regex to parse define() contents, possible?

查看:31
本文介绍了正则表达式解析define()内容,可能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对正则表达式很陌生,这对我来说太高级了.所以我请教了这里的专家.

I am very new to regex, and this is way too advanced for me. So I am asking the experts over here.

问题我想从 phpdefine()

Problem I would like to retrieve the constants / values from a php define()

DEFINE('TEXT', 'VALUE');

基本上我希望正则表达式能够返回常量的名称,以及上一行中常量的值.只是 TEXT 和 VALUE .这甚至可能吗?

Basically I would like a regex to be able to return the name of constant, and the value of constant from the above line. Just TEXT and VALUE . Is this even possible?

我为什么需要它?我正在处理语言文件,我想获取所有对(名称、值)并将它们放入数组中.我设法用 str_replace() 和 trim() 等来做到这一点.但这种方式很长,我相信使用单行正则表达式可以更容易.

Why I need it? I am dealing with language file and I want to get all couples (name, value) and put them in array. I managed to do it with str_replace() and trim() etc.. but this way is long and I am sure it could be made easier with single line of regex.

注意:VALUE 也可能包含转义的单引号.例子:

Note: The VALUE may contain escaped single quotes as well. example:

DEFINE('TEXT', 'J\'ai');

我希望我没有要求太复杂的东西.:)

I hope I am not asking for something too complicated. :)

问候

推荐答案

对于任何类型的基于语法的解析,正则表达式通常是一个糟糕的解决方案.即使是简单的语法(如算术)也有嵌套,而且正是在嵌套(特别是)中,正则表达式才会失败.

For any kind of grammar-based parsing, regular expressions are usually an awful solution. Even smple grammars (like arithmetic) have nesting and it's on nesting (in particular) that regular expressions just fall over.

幸运的是,PHP 为您提供了一个更好的解决方案,它允许您通过 token_get_all() 函数.给它一个 PHP 代码的字符流,它会将其解析为标记(词素"),您可以使用非常简单的 有限状态机.

Fortunately PHP provides a far, far better solution for you by giving you access to the same lexical analyzer used by the PHP interpreter via the token_get_all() function. Give it a character stream of PHP code and it'll parse it into tokens ("lexemes"), which you can do a bit of simple parsing on with a pretty simple finite state machine.

运行这个程序(它作为 test.php 运行,所以它自己尝试).该文件被故意格式化得很差,因此您可以看到它可以轻松处理.

Run this program (it's run as test.php so it tries it on itself). The file is deliberately formatted badly so you can see it handles that with ease.

<?
    define('CONST1', 'value'   );
define   (CONST2, 'value2');
define(   'CONST3', time());
  define('define', 'define');
    define("test", VALUE4);
define('const5', //

'weird declaration'
)    ;
define('CONST7', 3.14);
define ( /* comment */ 'foo', 'bar');
$defn = 'blah';
define($defn, 'foo');
define( 'CONST4', define('CONST5', 6));

header('Content-Type: text/plain');

$defines = array();
$state = 0;
$key = '';
$value = '';

$file = file_get_contents('test.php');
$tokens = token_get_all($file);
$token = reset($tokens);
while ($token) {
//    dump($state, $token);
    if (is_array($token)) {
        if ($token[0] == T_WHITESPACE || $token[0] == T_COMMENT || $token[0] == T_DOC_COMMENT) {
            // do nothing
        } else if ($token[0] == T_STRING && strtolower($token[1]) == 'define') {
            $state = 1;
        } else if ($state == 2 && is_constant($token[0])) {
            $key = $token[1];
            $state = 3;
        } else if ($state == 4 && is_constant($token[0])) {
            $value = $token[1];
            $state = 5;
        }
    } else {
        $symbol = trim($token);
        if ($symbol == '(' && $state == 1) {
            $state = 2;
        } else if ($symbol == ',' && $state == 3) {
            $state = 4;
        } else if ($symbol == ')' && $state == 5) {
            $defines[strip($key)] = strip($value);
            $state = 0;
        }
    }
    $token = next($tokens);
}

foreach ($defines as $k => $v) {
    echo "'$k' => '$v'\n";
}

function is_constant($token) {
    return $token == T_CONSTANT_ENCAPSED_STRING || $token == T_STRING ||
        $token == T_LNUMBER || $token == T_DNUMBER;
}

function dump($state, $token) {
    if (is_array($token)) {
        echo "$state: " . token_name($token[0]) . " [$token[1]] on line $token[2]\n";
    } else {
        echo "$state: Symbol '$token'\n";
    }
}

function strip($value) {
    return preg_replace('!^([\'"])(.*)\1$!', '$2', $value);
}
?>

输出:

'CONST1' => 'value'
'CONST2' => 'value2'
'CONST3' => 'time'
'define' => 'define'
'test' => 'VALUE4'
'const5' => 'weird declaration'
'CONST7' => '3.14'
'foo' => 'bar'
'CONST5' => '6'

这基本上是一个寻找模式的有限状态机:

This is basically a finite state machine that looks for the pattern:

function name ('define')
open parenthesis
constant
comma
constant
close parenthesis

在 PHP 源文件的词法流中,并将两个常量视为 (name,value) 对.这样做时,它会处理嵌套的 define() 语句(根据结果)并忽略空格和注释以及跨多行工作.

in the lexical stream of a PHP source file and treats the two constants as a (name,value) pair. In doing so it handles nested define() statements (as per the results) and ignores whitespace and comments as well as working across multiple lines.

注意:我故意让它忽略函数和变量是常量名称或值的情况,但您可以根据需要扩展它.

Note: I've deliberatley made it ignore the case when functions and variables are constant names or values but you can extend it to that as you wish.

还值得指出的是,PHP 在处理字符串时非常宽容.它们可以用单引号、双引号或(在某些情况下)根本不带引号来声明.这可能(如 Gumbo 所指出的)是对常量的模棱两可的引用,您无法知道它是哪个(无论如何都无法保证),给您以下选择:

It's also worth pointing out that PHP is quite forgiving when it comes to strings. They can be declared with single quotes, double quotes or (in certain circumstances) with no quotes at all. This can be (as pointed out by Gumbo) be an ambiguous reference reference to a constant and you have no way of knowing which it is (no guaranteed way anyway), giving you the chocie of:

  1. 忽略该样式的字符串 (T_STRING);
  2. 查看是否已经使用该名称声明了一个常量并替换它的值.但是,您无法知道其他文件被调用了哪些文件,也无法处理有条件创建的任何定义,因此您无法确定任何内容是否绝对是常量或它具有什么价值;或
  3. 您可以接受这些可能是常量(这不太可能)的可能性,并将它们视为字符串.

我个人会选择 (1) 然后 (3).

Personally I would go for (1) then (3).

这篇关于正则表达式解析define()内容,可能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆