我有一个PHP regEx,如何为字符数添加条件? [英] I have a PHP regEx, how do add a condition for the number of characters?

查看:47
本文介绍了我有一个PHP regEx,如何为字符数添加条件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个我在 php 中使用的正则表达式:

I have a regular expression that Im using in php:

$word_array = preg_split(
    '/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
    urldecode($path), NULL, PREG_SPLIT_NO_EMPTY
);

效果很好.它需要一大块 url 参数,例如:

It works great. It takes a chunk of url paramaters like:

/2009/06/pagerank-update.html

并返回一个数组,如:

and returns an array like:

array(4) {
  [0]=>
  string(4) "2009"
  [1]=>
  string(2) "06"
  [2]=>
  string(8) "pagerank"
  [3]=>
  string(6) "update"
}

我唯一需要的是它也不返回少于 3 个字符的字符串.所以 "06" 字符串是垃圾,我目前正在使用 if 语句来清除它们.

The only thing I need is for it to also not return strings that are less than 3 characters. So the "06" string is garbage and I'm currently using an if statement to weed them out.

推荐答案

分裂的魔力.我最初的假设在技术上是不正确的(尽管更容易找到解决方案).因此,让我们检查一下您的拆分模式:

The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let's check your split pattern:

(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)

我重新整理了一下.外括号不是必需的,我最后将单个字符移到了字符类中:

I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:

 html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]

用于预先排序.让我们将此模式称为拆分模式,简称为 s 并对其进行定义.

That for some sorting upfront. Let's call this pattern the split pattern, s in short and define it.

您希望匹配不属于 split-at 模式中的那些字符的所有部分,并且至少要匹配三个字符.

You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.

我可以通过以下模式实现这一点,包括支持正确的拆分序列和 unicode 支持.

I could achieve this with the following pattern, including support of the correct split sequences and unicode support.

$pattern    = '/
    (?(DEFINE)
        (?<s> # define subpattern which is the split pattern
            html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
            [\\/._=?&%+-] # a little bit optimized with a character class
        )
    )
    (?:(?&s))          # consume the subpattern (URL starts with \/)
    \K                 # capture starts here
    (?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
/ux';

或者更小:

$path       = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
$subject    = urldecode($path);
$pattern    = '/(?(DEFINE)(?<s>html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
$word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
print_r($word_array);

结果:

Array
(
    [0] => 2009
    [1] => pagerank
    [2] => update
    [3] => test
    [4] => testä
)

同样的原理也适用于 preg_split.有点不同:

The same principle can be used with preg_split as well. It's a little bit different:

$pattern = '/
    (?(DEFINE)       # define subpattern which is the split pattern
        (?<s>
    html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
    [\/._=?&%+-]
        )
    )
    (?:(?!(?&s)).){3,}(*SKIP)(*FAIL)       # three or more is okay
    |(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT)   # two or one is none
    |(?&s)                                 # split @ split, at least
/ux';

用法:

$word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);

结果:

Array
(
    [0] => 2009
    [1] => pagerank
    [2] => update
    [3] => test
    [4] => testä
)

这些例程按要求工作.但这确实以性能为代价.成本与旧答案相似.

These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.

相关问题:

旧答案,进行两步处理(先拆分,然后过滤)

Old answer, doing a two-step processing (first splitting, then filtering)

因为您使用的是拆分例程,它会拆分 - 无论长度如何.

Because you are using a split routine, it will split - regardless of the length.

所以你可以做的是过滤结果.您可以使用正则表达式(preg_filter)再次执行此操作,例如正在删除所有更小的三个字符:

So what you can do is to filter the result. You can do that again with a regular expression (preg_filter), for example one that is dropping everything smaller three characters:

$word_array = preg_filter(
    '/^.{3,}$/', '$0', 
    preg_split(
        '/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
        urldecode($path), 
        NULL, 
        PREG_SPLIT_NO_EMPTY
    )
);

结果:

Array
(
    [0] => 2009
    [2] => pagerank
    [3] => update
)

这篇关于我有一个PHP regEx,如何为字符数添加条件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆