Ruby文字分析 [英] Ruby Text Analysis
问题描述
是否有任何Ruby gem或其他文本分析工具?词频,模式检测等(最好是法语的理解)
Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french)
推荐答案
单词频率的概括是语言模型,例如单字组(=单个单词的频率),双字组(=单词对的频率),三字组(=世界三元组的频率),...,通常:n-grams
the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams
您应该寻找现有的语言模型工具包-在这里重新发明轮子不是一个好主意.
You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.
有一些标准工具包,例如来自CMU Sphinx团队和HTK.
There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.
这些工具包通常是用C编写的(为了提高速度,因为您必须处理巨大的语料库),并生成标准输出格式的ARPA n-gram文件(通常是文本格式)
These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)
检查以下线程,该线程包含更多详细信息和链接:
Check the following thread, which contains more details and links:
使用其中一个工具包生成语言模型后,您将需要一个Ruby Gem(使该语言模型可以在Ruby中访问),或者需要将ARPA格式转换为自己的格式.
Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.
adi92的帖子列出了更多的Ruby NLP资源.
adi92's post lists some more Ruby NLP resources.
您还可以通过Google搜索"ARPA语言模型"了解更多信息
You can also Google for "ARPA Language Model" for more info
至少要检查 Google的在线N-gram工具.他们根据数字化的书籍构建了n-gram,也提供法语和其他语言版本!
Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!
这篇关于Ruby文字分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!