您将如何编码一个抗窃网站? [英] How would you code an anti plagiarism site?

查看:83
本文介绍了您将如何编码一个抗窃网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,请注意,我对这样的工作方式很感兴趣,并且不打算为客户端等构建它,因为我确信可能已经有开源实现.

First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.

检测上传文本中的抄袭的算法如何工作?它是否使用正则表达式将所有单词发送到索引,去掉诸如"the","a"等已知单词,然后查看不同论文中有多少个单词相同?它们是否具有相同数量的不可思议的单词数,将其标记为可能重复的单词?它是否使用 levenshtein()?

How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?

我选择的语言是PHP.

My language of choice is PHP.

更新

我正在考虑不在全球范围内进行窃检查,但在课堂上上传的30篇论文中,我会说更多.如果学生们严格按照一个人布置在一起.

I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.

这里是一个声称这样做的在线网站: http://www.plagiarism.org/

Here is an online site that claims to do so: http://www.plagiarism.org/

推荐答案

良好的抄袭检测将基于文档的类型(例如,特定语言的论文或程序代码)应用启发式方法.

Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).

但是,您也可以应用常规解决方案.看看归一化压缩距离(NCD).显然,您无法精确计算文本的 Kolmogorov复杂度,但是您可以通过简单地压缩文本来实现.

However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.

较小的NCD表示两个文本更相似.一些压缩 算法将提供比其他算法更好的结果.幸运的是PHP提供了支持 用于几种压缩算法,因此您可以进行NCD驱动的窃 检测代码可立即运行.下面我将给出示例代码,该代码使用 Zlib :

A smaller NCD indicates that two texts are more similar. Some compression algorithms will give better results than others. Luckily PHP provides support for several compression algorithms, so you can have your NCD-driven plagiarism detection code running in no-time. Below I'll give example code which uses Zlib:

PHP:

function ncd($x, $y) { 
  $cx = strlen(gzcompress($x));
  $cy = strlen(gzcompress($y));
  return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}   

print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));

Python:

>>> from zlib import compress as c
>>> def ncd(x, y): 
...     cx, cy = len(c(x)), len(c(y))
...     return (len(c(x + y)) - min(cx, cy)) / max(cx, cy) 
... 
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361

请注意,对于较大的文本(阅读:实际文件),结果会更多 明显的.试试看,并报告您的经验!

Note that for larger texts (read: actual files) the results will be much more pronounced. Give it a try and report your experiences!

这篇关于您将如何编码一个抗窃网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆