匹配不区分大小写的精确短语与空格 [英] Matching for case insensitive exact phrase with spaces

查看:49
本文介绍了匹配不区分大小写的精确短语与空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一个字符串 你好,我今天去了商店" 并且我有一个匹配数组

If I have a string "Hello I went to the store today" and I had an array of matches

$perfectMatches = array("i went","store today");

它应该匹配这两个.(数组可能会变得非常大,所以我更愿意在 1 个 preg_match 中进行)

It should match both of those. (the array can get quite large so i'd prefer to do it in 1 preg_match)

让这个工作!谢谢!

preg_match_all("/\b(" . implode($perfectMatches,"|") . ")\b/i", $string, $match1)

我还需要一个很难解释的单独的正则表达式.说我有一个数组

I also need a separate regular expression that is kind of hard to explain. Say I have an array

$array = array("birthday party","ice cream");//this can be very long

如果生日"和派对"以及字符串中的任何位置,是否可以获得匹配字符串的正则表达式?

Is it possible to get a regular expression that will match a string if "birthday" and "party" and anywhere in the string?

所以它应该匹配今天是我的生日,我要开个派对"?但是冰淇淋"也在 1 个 preg_match 中?

So it should match "Hi, it's my birthday and I'm going to have a party"? But with "ice cream" also in 1 preg_match?

谢谢

示例...

用户提交了一个项目的描述,我想检查垃圾邮件.我知道大多数垃圾邮件帖子都有个人支票"或特价"之类的词组,因此我想获得所有这些词组的列表并与说明一起检查.如果描述中有我列表中的任何短语,它将被标记为垃圾邮件.这个场景适用于我想要的第一个正则表达式.

A user submits a description of an item and I want to check for spam. I know that most spam posts have phrases like "personal checks" or "hot deal" so I want to get a list of all these phrases and check it with the description. If the description has any of the phrases in my list, it'll be marked as spam. This scenario applies to the first regular expression I want.

第二个正则表达式是,如果我知道某些垃圾邮件帖子的某处有减肥"体重"快"字样,不必按任何顺序排列,但这三个字在说明中.因此,如果我得到这些短语快速减肥"、需要信用卡"的列表,并与描述核对,我可以将其标记为垃圾邮件

The second regular expression would be if I knew that some spam posts have the words "lose" "weight" "fast" somewhere in there, doesn't have to be in any order, but those 3 words are in the description. So if I get a list of these phrases "lose weight fast","credit card required" and check it with the description, I can mark it as spam

推荐答案

听起来您的问题的第 1 部分已经解决,因此此答案仅关注第 2 部分.据我了解,您正试图确定给定的输入消息包含任意顺序的所有单词列表.

It sounds like part 1 of your problem is already solved, so this answer focuses only on part 2. As I understand it, you are trying to determine if a given input message contains all of a list of words in any order.

这可以通过为每条消息使用一个正则表达式和一个 preg_match 来完成,但是如果您有大量单词,则效率非常低.如果 N 是您要搜索的单词数,M 是消息的长度,那么算法应该是 O(N*M).如果您注意到,每个 关键字的正则表达式中有两个 .* 术语.使用先行断言,正则表达式引擎必须为每个关键字遍历一次.下面是示例代码:

This can be done with a regex and a single preg_match for each message, but it is very inefficient if you have a large list of words. If N is the number of words you are searching for and M is the length of the message, then the algorithm should be O(N*M). If you notice, there are two .* terms in the regex for each keyword. With the lookahead assertions, the regex engine has to traverse once for each keyword. Here is the example code:

<?php

// sample messages
$msg1 = "Lose all the weight all the weight you want.  It's fast and easy!";
$msg2 = 'Are you over weight? lose the pounds fast!';
$msg3 = 'Lose weight slowly by working really hard!';

// spam defining keywords (all required, but any order).
$keywords = array('lose', 'weight', 'fast');

//build the regex pattern using the array of keywords
$patt = '/(?=.*\b'. implode($keywords, '\b.*)(?=.*\b') . '\b.*)/is';

echo "The pattern is: '" .$patt. "'\n";
echo 'msg1 '. (preg_match($patt, $msg1) ? 'is' : 'is not') ." spam\n";
echo 'msg2 '. (preg_match($patt, $msg2) ? 'is' : 'is not') ." spam\n";
echo 'msg3 '. (preg_match($patt, $msg3) ? 'is' : 'is not') ." spam\n";
?>

输出为:

The pattern is: '/(?=.*\blose\b.*)(?=.*\bweight\b.*)(?=.*\bfast\b.*)/is'
msg1 is spam
msg2 is spam
msg3 is not spam

第二个解决方案看起来更复杂,因为代码更多,但正则表达式要简单得多.它没有前瞻断言和 .* 项.preg_match 函数在 while 循环中被调用,但这并不是什么大问题.每条消息只遍历一次,复杂度应该是 O(M).这也可以使用单个 preg_match_all 函数来完成,但是您必须执行 array_search 以获得最终计数.

This second solution seems more complex because there is more code, but the regex is much simpler. It has no lookahead assertions and no .* terms. The preg_match function is called in a while loop, but this is not really a big deal. Each message is traversed only once and the complexity should be O(M). This could also be done with a single preg_match_all function, but then you would have to perform an array_search to get the final count.

<?php

// sample messages
$msg1 = "Lose all the weight all the weight you want.  It's fast and easy!";
$msg2 = 'Are you over weight? lose the pounds fast!';
$msg3 = 'Lose weight slowly by working really hard!';

// spam defining keywords (all required, but any order).
$keywords = array('lose', 'weight', 'fast');

//build the regex pattern using the array of keywords
$patt = '/(\b'. implode($keywords,'\b|\b') .'\b)/is';

echo "The pattern is: '" .$patt. "'\n";
echo 'msg1 '. (matchall($patt, $msg1, $keywords) ? 'is' : 'is not') ." spam\n";
echo 'msg2 '. (matchall($patt, $msg2, $keywords) ? 'is' : 'is not') ." spam\n";
echo 'msg3 '. (matchall($patt, $msg3, $keywords) ? 'is' : 'is not') ." spam\n";

function matchall($patt, $msg, $keywords)
{
  $offset = 0;
  $matches = array();
  $index = array_fill_keys($keywords, 0);
  while( preg_match($patt, $msg, &$matches, PREG_OFFSET_CAPTURE, $offset) ) {
    $offset = $matches[1][1] + strlen($matches[1][0]);
    $index[strtolower($matches[1][0])] += 1;
  }
  return min($index);
}
?>

输出为:

The pattern is: '/(\blose\b|\bweight\b|\bfast\b)/is'
msg1 is spam
msg2 is spam
msg3 is not spam

这篇关于匹配不区分大小写的精确短语与空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆