PHP mb_split(),捕获定界符 [英] PHP mb_split(), capturing delimiters

查看:72
本文介绍了PHP mb_split(),捕获定界符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

preg_split 有一个可选的 PREG_SPLIT_DELIM_CAPTURE 标志,该标志还返回返回数组中的所有定界符. mb_split 不会.

preg_split has an optional PREG_SPLIT_DELIM_CAPTURE flag, which also returns all delimiters in the returned array. mb_split does not.

有什么方法可以拆分多字节字符串(不仅仅是UTF-8,而是所有类型)并捕获定界符?

Is there any way to split a multibyte string (not just UTF-8, but all kinds) and capture the delimiters?

我正在尝试制作一个多字节安全的换行分隔符,保留换行符,但希望使用更通用的解决方案.

I'm trying to make a multibyte-safe linebreak splitter, keeping the linebreaks, but would prefer a more genericaly usable solution.

解决方案感谢用户Casimir et Hippolyte,我建立了一个解决方案并将其发布在github上( https://github.com/vanderlee/PHP-multibyte-functions/blob/master/functions/mb_explode.php ),它允许所有preg_split标志:

Solution Thanks to user Casimir et Hippolyte, I built a solution and posted it on github (https://github.com/vanderlee/PHP-multibyte-functions/blob/master/functions/mb_explode.php), which allows all the preg_split flags:

/**
 * A cross between mb_split and preg_split, adding the preg_split flags
 * to mb_split.
 * @param string $pattern
 * @param string $string
 * @param int $limit
 * @param int $flags
 * @return array
 */
function mb_explode($pattern, $string, $limit = -1, $flags = 0) {       
    $strlen = strlen($string);      // bytes!   
    mb_ereg_search_init($string);

    $lengths = array();
    $position = 0;
    while (($array = mb_ereg_search_pos($pattern)) !== false) {
        // capture split
        $lengths[] = array($array[0] - $position, false, null);

        // move position
        $position = $array[0] + $array[1];

        // capture delimiter
        $regs = mb_ereg_search_getregs();           
        $lengths[] = array($array[1], true, isset($regs[1]) && $regs[1]);

        // Continue on?
        if ($position >= $strlen) {
            break;
        }           
    }

    // Add last bit, if not ending with split
    $lengths[] = array($strlen - $position, false, null);

    // Substrings
    $parts = array();
    $position = 0;      
    $count = 1;
    foreach ($lengths as $length) {
        $is_delimiter   = $length[1];
        $is_captured    = $length[2];

        if ($limit > 0 && !$is_delimiter && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY) && ++$count > $limit) {
            if ($length[0] > 0 || ~$flags & PREG_SPLIT_NO_EMPTY) {          
                $parts[]    = $flags & PREG_SPLIT_OFFSET_CAPTURE
                            ? array(mb_strcut($string, $position), $position)
                            : mb_strcut($string, $position);                
            }
            break;
        } elseif ((!$is_delimiter || ($flags & PREG_SPLIT_DELIM_CAPTURE && $is_captured))
               && ($length[0] || ~$flags & PREG_SPLIT_NO_EMPTY)) {
            $parts[]    = $flags & PREG_SPLIT_OFFSET_CAPTURE
                        ? array(mb_strcut($string, $position, $length[0]), $position)
                        : mb_strcut($string, $position, $length[0]);
        }

        $position += $length[0];
    }

    return $parts;
}

推荐答案

只有使用 preg_split 才能捕获定界符,而其他功能则不可用.

Capturing delimiters is only possible with preg_split and is not available in other functions.

那么三种可能性:

1)将您的字符串转换为UTF8,将 preg_split PREG_SPLIT_DELIM_CAPTURE 一起使用,并使用 array_map 进行转换项目恢复为原始编码.

1) convert your string to UTF8, use preg_split with PREG_SPLIT_DELIM_CAPTURE, and use array_map to convert each items to the original encoding.

这种方式更简单.第二种情况并非如此.(请注意,通常,始终使用UTF8而不是处理外来编码会更简单)

This way is the more simple. That is not the case in the second way. (Note that in general, it is more simple to work always in UTF8, instead of dealing with exotic encodings)

2),代替您需要使用的类似于的功能,例如

2) in place of a split-like function you need to use for example mb_ereg_search_regs to get the matched parts and to build the pattern like this:

delimiter|all_that_is_not_the_delimiter

(请注意,交替的两个分支必须互斥,并注意以使结果之间不可能出现间隙的方式编写它们.第一部分必须在字符串的开头,而最后一部分必须在字符串的开头必须在末尾.每个部分都必须与前一个部分相邻,依此类推.)

3) mb_split 环顾四周.根据定义,环视是零宽度的断言,不与任何字符匹配,而仅与字符串中的位置匹配.因此,您可以使用这种模式来匹配定界符之后或之前的位置:

3) use mb_split with lookarounds. By definition, lookarounds are zero-width assertions and don't match any characters but only positions in the string. So you can use this kind of pattern that matches positions after or before the delimiter:

(?=delimiter)|(<=delimiter)

(这种方式的局限性在于,后视中的子模式不能具有可变长度(换句话说,内部不能使用量词),但是它可以是固定长度的替代形式子模式:(?< = subpat1 | subpat2 | subpat3))

这篇关于PHP mb_split(),捕获定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆