检测复制的或类似的文本块 [英] Detecting copied or similar text blocks

查看:68
本文介绍了检测复制的或类似的文本块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆关于Markdown格式编程的文字.有一个构建过程可以将那些文本转换为Word/HTML,还可以执行简单的验证规则,例如拼写检查或检查文档是否具有所需的标头结构.我想扩展该构建代码,以也在所有文本中检查复制粘贴或类似的块.

I have a bunch of texts about programming in Markdown format. There is a build process that is capable of converting those texts into Word/HTML and also perform simple validation rules like spell checking or checking if document has required header structure. I would like to extend that build code to also check for copy-pasted or similar chunks within all texts.

是否存在任何可以帮助我进行分析的Java/Groovy库?

Is there any existing Java/Groovy library that can help me with that analysis?

我的第一个想法是使用PMD的CopyPasteDetector,但是它过于面向分析实际代码.我看不到如何使用它来分析普通文本.

My first idea was to use PMD's CopyPasteDetector, but it is too much oriented to analyse real code. I don't see how I can use it to analyse normal text.

推荐答案

我最终还是使用了CPD和Groovy.如果有人感兴趣,请看下面的代码:

I ended up using CPD and Groovy after all. Here is the code if some one is interested:

import net.sourceforge.pmd.cpd.Tokens
import net.sourceforge.pmd.cpd.TokenEntry
import net.sourceforge.pmd.cpd.Tokenizer
import net.sourceforge.pmd.cpd.CPDNullListener
import net.sourceforge.pmd.cpd.MatchAlgorithm
import net.sourceforge.pmd.cpd.SourceCode
import net.sourceforge.pmd.cpd.SourceCode.StringCodeLoader
import net.sourceforge.pmd.cpd.SimpleRenderer

// Prepare empty token data.
TokenEntry.clearImages()
def tokens = new Tokens()

// List all source files with text.
def source = new TreeMap<String, SourceCode>()
new File('.').eachFile { file ->
  if (file.isFile() && file.name.endsWith('.txt')) {
    def analyzedText = file.text
    def sourceCode = new SourceCode(new StringCodeLoader(analyzedText, file.name))
    source.put(sourceCode.fileName, sourceCode)
    analyzedText.eachLine { line, lineNumber ->
      line.split('[\\W\\s\\t\\f]+').each { token ->
        token = token.trim()
        if (token) {
          tokens.add(new TokenEntry(token, sourceCode.fileName, lineNumber + 1))
        }
      }
    }
    tokens.add(TokenEntry.getEOF())
  }
}

// Run matching algorithm.
def maxTokenChain = 15
def matchAlgorithm = new MatchAlgorithm(source, tokens, maxTokenChain, new CPDNullListener())
matchAlgorithm.findMatches()

// Produce report.
matchAlgorithm.matches().each { match ->
  println "  ========================================"
  match.iterator().each { mark ->
    println "  DUPLICATION ERROR: <${mark.tokenSrcID}:${mark.beginLine}> [DUPLICATION] Found a ${match.lineCount} line (${match.tokenCount} tokens) duplication!"
  }
  def indentedTextSlice = ""
  match.sourceCodeSlice.eachLine { line ->
    indentedTextSlice += "  $line\n"
  }
  println "  ----------------------------------------"
  println indentedTextSlice
  println "  ========================================"
}

这篇关于检测复制的或类似的文本块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆