从片段中检测编程语言 [英] Detecting programming language from a snippet

查看:52
本文介绍了从片段中检测编程语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

检测代码片段中使用的编程语言的最佳方法是什么?

What would be the best way to detect what programming language is used in a snippet of code?

推荐答案

我认为垃圾邮件过滤器中使用的方法会很好用.您将片段拆分为单词.然后,您将这些单词的出现次数与已知片段进行比较,并针对您感兴趣的每种语言计算此片段是用语言 X 编写的概率.

I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.

http://en.wikipedia.org/wiki/Bayesian_spam_filtering

如果您拥有基本机制,那么添加新语言非常容易:只需使用新语言中的一些片段训练检测器(您可以将其提供给一个开源项目).通过这种方式,它了解到System"很可能出现在 C# 代码片段中,而puts"可能出现在 Ruby 代码片段中.

If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.

我实际上已经使用这种方法将语言检测添加到论坛软件的代码片段中.它在 100% 的时间里都有效,除非在不明确的情况下:

I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:

print "Hello"

让我找到代码.

我找不到代码,所以我做了一个新的.这有点简单,但它适用于我的测试.目前,如果你提供的 Python 代码比 Ruby 代码多得多,它很可能会说这段代码:

I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:

def foo
   puts "hi"
end

是 Python 代码(尽管它确实是 Ruby).这是因为 Python 也有一个 def 关键字.因此,如果它在 Python 中看到 1000x def 和在 Ruby 中看到 100x def 那么它可能仍然说 Python,即使 putsend 是特定于 Ruby 的.您可以通过跟踪每种语言看到的单词并在某处除以该单词来解决这个问题(或者通过在每种语言中输入等量的代码).

is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).

希望能帮到你:

class Classifier
  def initialize
    @data = {}
    @totals = Hash.new(1)
  end

  def words(code)
    code.split(/[^a-z]/).reject{|w| w.empty?}
  end

  def train(code,lang)
    @totals[lang] += 1
    @data[lang] ||= Hash.new(1)
    words(code).each {|w| @data[lang][w] += 1 }
  end

  def classify(code)
    ws = words(code)
    @data.keys.max_by do |lang|
      # We really want to multiply here but I use logs 
      # to avoid floating point underflow
      # (adding logs is equivalent to multiplication)
      Math.log(@totals[lang]) +
      ws.map{|w| Math.log(@data[lang][w])}.reduce(:+)
    end
  end
end

# Example usage

c = Classifier.new

# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)

# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)

这篇关于从片段中检测编程语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆