PHP检测重复文本 [英] PHP Detect Duplicate Text

查看:328
本文介绍了PHP检测重复文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个站点,用户可以在其中描述自己。



大多数用户都写了适当的东西,但有些人只是多次复制/粘贴相同的文本(以创建大量文本外观。)


例如:爱与和平爱与和平爱与和平爱与和平


有没有很好的方法来用PHP检测重复文本?

我目前唯一的概念是将文本分成单独的单词(由空格分隔),然后查看该单词是否重复得更多,再加上一个限定的单词。注意:我不确定100%如何编写此解决方案的代码。



是否有检测重复文本的最佳方法?或如何编码上述想法?

解决方案

这是一个基本的文本分类问题。 :





长长的尾巴表示很多独特的单词。仍然有一些重复,但是总体形状显示出一些变化。



仅供参考,PHP具有统计信息软件包。


I have a site where users can put in a description about themselves.

Most users write something appropriate but some just copy/paste the same text a number of times (to create the appearance of a fair amount of text).

eg: "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace"

Is there a good method to detect repetitive text with PHP?

The only concept I currently have would be to break the text into separate words (delimited by space) and then look to see if the word is repeated more then a set limited. Note: I'm not 100% sure how I would code this solution.

Thoughts on the best way to detect duplicate text? Or how to code the above idea?

解决方案

This is a basic text classification problem. There are lots of articles out there on how to determine if some text is spam/not spam which I'd recommend digging into if you really want to get into the details. A lot of it is probably overkill for what you need to do here.

Granted one approach would be to evaluate why you're requiring people to enter longer bios, but I'll assume you've already decided that forcing people to enter more text is the way to go.

Here's an outline of what I would do:

  1. Build a histogram of word occurrences for the input string
  2. Study the histograms of some valid and invalid text
  3. Come up with a formula for classifying a histogram as valid or not

This approach would require you to figure out what's different between the two sets. Intuitively, I'd expect spam to show fewer unique words and if you plot the histogram values, a higher area under the curve concentrated toward the top words.

Here's some sample code to get you going:

$str = 'Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace';

// Build a histogram mapping words to occurrence counts
$hist = array();

// Split on any number of consecutive whitespace characters
foreach (preg_split('/\s+/', $str) as $word)
{
  // Force all words lowercase to ignore capitalization differences
  $word = strtolower($word);

  // Count occurrences of the word
  if (isset($hist[$word]))
  {
    $hist[$word]++;
  }
  else
  {
    $hist[$word] = 1;
  }
}

// Once you're done, extract only the counts
$vals = array_values($hist);
rsort($vals); // Sort max to min

// Now that you have the counts, analyze and decide valid/invalid
var_dump($vals);

When you run this code on some repetitive strings, you'll see the difference. Here's a plot of the $vals array from the example string you gave:

Compare that with the first two paragraphs of Martin Luther King Jr.'s bio from Wikipedia:

A long tail indicates lots of unique words. There's still some repetition, but the general shape shows some variation.

FYI, PHP has a stats package you can install if you're going to be doing lots of math like standard deviation, distribution modeling, etc.

这篇关于PHP检测重复文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆