我有一个PHP regEx,如何为字符数添加条件? [英] I have a PHP regEx, how do add a condition for the number of characters?
问题描述
我有一个我在 php 中使用的正则表达式:
I have a regular expression that Im using in php:
$word_array = preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path), NULL, PREG_SPLIT_NO_EMPTY
);
效果很好.它需要一大块 url 参数,例如:
It works great. It takes a chunk of url paramaters like:
/2009/06/pagerank-update.html
并返回一个数组,如:
and returns an array like:
array(4) {
[0]=>
string(4) "2009"
[1]=>
string(2) "06"
[2]=>
string(8) "pagerank"
[3]=>
string(6) "update"
}
我唯一需要的是它也不返回少于 3 个字符的字符串.所以 "06"
字符串是垃圾,我目前正在使用 if 语句来清除它们.
The only thing I need is for it to also not return strings that are less than 3 characters. So the "06"
string is garbage and I'm currently using an if statement to weed them out.
推荐答案
分裂的魔力.我最初的假设在技术上是不正确的(尽管更容易找到解决方案).因此,让我们检查一下您的拆分模式:
The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let's check your split pattern:
(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)
我重新整理了一下.外括号不是必需的,我最后将单个字符移到了字符类中:
I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]
用于预先排序.让我们将此模式称为拆分模式,简称为 s
并对其进行定义.
That for some sorting upfront. Let's call this pattern the split pattern, s
in short and define it.
您希望匹配不属于 split-at 模式中的那些字符的所有部分,并且至少要匹配三个字符.
You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.
我可以通过以下模式实现这一点,包括支持正确的拆分序列和 unicode 支持.
I could achieve this with the following pattern, including support of the correct split sequences and unicode support.
$pattern = '/
(?(DEFINE)
(?<s> # define subpattern which is the split pattern
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\\/._=?&%+-] # a little bit optimized with a character class
)
)
(?:(?&s)) # consume the subpattern (URL starts with \/)
\K # capture starts here
(?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
/ux';
或者更小:
$path = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
$subject = urldecode($path);
$pattern = '/(?(DEFINE)(?<s>html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
$word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
print_r($word_array);
结果:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
同样的原理也适用于 preg_split
.有点不同:
The same principle can be used with preg_split
as well. It's a little bit different:
$pattern = '/
(?(DEFINE) # define subpattern which is the split pattern
(?<s>
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\/._=?&%+-]
)
)
(?:(?!(?&s)).){3,}(*SKIP)(*FAIL) # three or more is okay
|(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT) # two or one is none
|(?&s) # split @ split, at least
/ux';
用法:
$word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);
结果:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
这些例程按要求工作.但这确实以性能为代价.成本与旧答案相似.
These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.
相关问题:
旧答案,进行两步处理(先拆分,然后过滤)
Old answer, doing a two-step processing (first splitting, then filtering)
因为您使用的是拆分例程,它会拆分 - 无论长度如何.
Because you are using a split routine, it will split - regardless of the length.
所以你可以做的是过滤结果.您可以使用正则表达式(preg_filter
)再次执行此操作,例如正在删除所有更小的三个字符:
So what you can do is to filter the result. You can do that again with a regular expression (preg_filter
), for example one that is dropping everything smaller three characters:
$word_array = preg_filter(
'/^.{3,}$/', '$0',
preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path),
NULL,
PREG_SPLIT_NO_EMPTY
)
);
结果:
Array
(
[0] => 2009
[2] => pagerank
[3] => update
)
这篇关于我有一个PHP regEx,如何为字符数添加条件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!