code识别的编程语言在一个文本文件 [英] Code for identifying programming language in a text file

查看:139
本文介绍了code识别的编程语言在一个文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该写将输出的编程语言是它在给定的文本文件code(来源$ C ​​$ C)作为输入。这就是问题的最基本的定义。更多的约束条件如下:

i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow:

  • 我必须用C ++写。
  • 在各种各样的语言应当承认 - HTML,PHP,Perl中,红宝石,C,C ++,Java和C#...
  • 误报金额(错误识别)应为低 - 更好地输出未知不是一个错误的结果。 (这将是概率例如为未知的清单:100%,见下文)
  • 在输出应该是概率为每一个code知道语言的清单,因此,如果它知道C,Java和Perl,输出应该是例如:C:70%,Java的:50%的Perl: 30%(注意,没有必要为具有概率之和达到100%)
  • 在它应该有精度/速度的很好的比例(速度更偏爱一点)

这将是非常好的,如果code可以写的方式,增加新的语言识别将是相当简单,只涉及加入设置/数据的特殊语言。我可以使用任何可用 - 启发式,神经网络,黑魔法。什么都行。我'甚至允许使用现有的解决方案,但是:该解决方案必须是免费的,开源的,并允许商业用途。它必须进来容易积源$ C ​​$ C形式或静态库 - 无DLL。但是我preFER写我自己的code或者只是使用的另一种解决办法片段,我受够了积分$ C $别人℃。最后要注意的:也许有些人会认为FANN(快人工神经网络库) - 这是我不能使用的唯一的事情,因为这是我们使用的已经是事情,我们希望替换

It would be very nice if the code could be written in a way that adding new languages for recognition will be fairly easy and involve just adding "settings/data" for that particular language. I can use anything available - a heuristic, a neural network, black magic. Anything. I'am even allowed to use existing solutions, but: the solution must be free, opensource and allow commercial usage. It must come in form of easily integrable source code or as a static library - no DLL. However i prefer writing my own code or just using fragments of another solution, i'm fed up with integrating code of others. Last note: maybe some of you will suggest FANN (fast artificial neural network library) - this is the only thing i cannot use, since this is the thing we use ALREADY and we want to replace that.

现在的问题是:你会如何处理这样的任务,你会怎么办?如何实现这个或用什么有什么建议?

Now the question is: how would you handle such a task, what would you do? Any suggestions how to implement this or what to use?

编辑:的基础上的意见和答案,我必须强调,有些事情我忘了:速度是非常关键的,因为这将让成千上万的文件,并要回答快,所以看着千文件应该产生最多的答案为所有的人都在几秒钟内(文件大小将是小,当然,几KB各一个)。因此,尝试编译每一个不成问题。事情是,我真的想概率为每种语言 - 所以我更想知道该文件很可能是C或C ++,但该机会,它是一个bash脚本是非常低的。由于code混淆,评论等,我认为寻找一个100%的准确code是一个坏主意,事实上不是这样的目标。

based on the comments and answers i must emphasize some things i forgot: speed is very crucial, since this will get thousands of files and is supposed to answer fast, so looking at a thousand files should produce answers for all of them in a few seconds at most (the size of files will be small of course, a few kB each one). So trying to compile each one is out of question. The thing is, that i really want probabilities for each language - so i rather want to know that the file is likely to be C or C++ but that the chance it is a bash script is very low. Due to code obfuscation, comments etc. i think that looking for a 100% accurate code is a bad idea and in fact is not the goal of this.

推荐答案

您有文档分类的一个问题。我建议你​​阅读有关朴素贝叶斯分类和的支持向量机的。在文章有链接库,实现这些算法,其中许多人有C ++接口。

You have a problem of document classification. I suggest you read about naive bayes classifiers and support vector machines. In the articles there are links to libraries which implement these algorithms and many of them have C++ interfaces.

这篇关于code识别的编程语言在一个文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆