PHP停止词表 [英] PHP Stop Word List

查看:161
本文介绍了PHP停止词表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在玩和我的code范围内停止的话我有一个数组饱,我想检查的话,和文字我要对证的数组。

I'm playing about with a stop words within my code I have an array full of words that I'd like to check, and an array of words I want to check against.

目前,我通过数组中的一个循环时的时间和去除的话如果in_array VS停止单词列表,但我不知道是否有这样做的更好方法的那一刻,我看着和array_diff和这样的但是如果我有多个停用词的第一个数组中,和array_diff似乎只取出一次出现。

At the moment I'm looping through the array one at at a time and removing the word if its in_array vs the stop word list but I wonder if there's a better way of doing it, I've looked at array_diff and such however if I have multiple stop words in the first array, array_diff only appears to remove the first occurrence.

重点是速度和内存使用情况,但速度有过之而无不及。

The focus is on speed and memory usage but speed more so.

编辑 -

第一个数组是单数的话,基于博客评论(这些通常很长),第二排是停止的话单数的话。对不起,不作出清楚

The first array is singular words, based on blog comments (these are usually quite long) the second array is singular words of stop words. Sorry for not making that clear

感谢

推荐答案

有一个简单的方法是使用 str_replace转换 str_ireplace ,它可以采取'针'(的东西搜索),相应的替代品,数组和'草堆'(的东西进行操作)的数组。

Using str_replace...

A simple approach is to use str_replace or str_ireplace, which can take an array of 'needles' (things to search for), corresponding replacements, and an array of 'haystacks' (things to operate on).

$haystacks=array(
  "The quick brown fox",
  "jumps over the ",
  "lazy dog"
);

$needles=array(
  "the", "lazy", "quick"
);

$result=str_ireplace($needles, "", $haystacks);

var_dump($result);

这产生

array(3) {
  [0]=>
  string(11) "  brown fox"
  [1]=>
  string(12) "jumps over  "
  [2]=>
  string(4) " dog"
}

顺便说一句,一个快速的方法来清理尾随空格这个叶子是使用 array_map 调用的trim 为每个元素

$result=array_map("trim", $result);

使用str_replace函数的缺点是,它会在的话,而不只是整个单词替换找到的匹配。为了解决这一问题,我们可以使用常规的前pressions ...

The drawback of using str_replace is that it will replace matches found within words, rather than just whole words. To address that, we can use regular expressions...

使用的一种方法 preg_replace 看起来非常类似于上面,但针是常规的前pressions ,我们检查'字边界在开始使用\\ b中的比赛结束

An approach using preg_replace looks very similar to the above, but the needles are regular expressions, and we check for a 'word boundary' at the start and end of the match using \b

$haystacks=array(
"For we shall use fortran to",
"fortify the general theme",
"of this torrent of nonsense"
);

$needles=array(
  '/\bfor\b/i', 
  '/\bthe\b/i', 
  '/\bto\b/i', 
  '/\bof\b/i'
);

$result=preg_replace($needles, "", $haystacks);

这篇关于PHP停止词表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆