N-gram:说明+ 2个应用 [英] N-grams: Explanation + 2 applications

查看:224
本文介绍了N-gram:说明+ 2个应用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用n-gram实现某些应用程序(最好在PHP中).

I want to implement some applications with n-grams (preferably in PHP).

哪种类型的n-gram最适合大多数目的?单词级别或字符级别的n-gram?您如何在PHP中实现n-gram-tokenizer?

Which type of n-grams is more adequate for most purposes? A word level or a character level n-gram? How could you implement an n-gram-tokenizer in PHP?

首先,我想知道N-gram到底是什么.这样对吗?这就是我理解n-gram的方式:

First, I would like to know what N-grams exactly are. Is this correct? It's how I understand n-grams:

句子:我住在纽约."

Sentence: "I live in NY."

单词级二元组(n为2):#I","I live","in live in","in NY","NY#"

word level bigrams (2 for n): "# I', "I live", "live in", "in NY", 'NY #'

字符级二元组(n为2):#I","I#",#l","li","iv","ve","e#",#i","in" ","n#",#N","NY","Y#"

character level bigrams (2 for n): "#I", "I#", "#l", "li", "iv", "ve", "e#", "#i", "in", "n#", "#N", "NY", "Y#"

当您拥有n克零件的数组时,请删除重复的零件,并为每个零件添加一个计数器,以给出频率:

When you have this array of n-gram-parts, you drop the duplicate ones and add a counter for each part giving the frequency:

单词级二元组:[1、1、1、1、1]

word level bigrams: [1, 1, 1, 1, 1]

字符级二元组:[2、1、1,...]

character level bigrams: [2, 1, 1, ...]

这正确吗?

此外,我想进一步了解n-gram的用途:

Furthermore, I would like to learn more about what you can do with n-grams:

  • 如何使用n-gram识别文本的语言?
  • 即使您没有双语语料库,也可以使用n-gram进行机器翻译吗?
  • 如何构建垃圾邮件过滤器(垃圾邮件,火腿)?结合n-gram与贝叶斯过滤器?
  • 如何发现主题?例如:是关于篮球还是狗的文字?我的方法(在Wikipedia上针对狗"和篮球"的文章中执行以下操作):为两个文档构建n-gram向量,对其进行归一化,计算Manhattan/Euclidian距离,结果越接近1,则结果越高相似性

您如何看待我的申请方法,尤其是最后一种?

What do you think about my application approaches, especially the last one?

希望您能帮助我.预先感谢!

I hope you can help me. Thanks in advance!

推荐答案

单词n-gram对于您提到的大多数文本分析应用程序通常会更有用,但语言检测可能会例外,在这种情况下,字符三叉戟可能会带来更好的结果.有效地,您将为您要检测的每种语言的文本语料库创建n-gram向量,然后将每个语料库中的卦的频率与您要分类的文档中的卦的频率进行比较.例如,三元组the在英语中的出现频率可能比在德语中的出现频率高得多,并且会提供一定程度的统计相关性.一旦您拥有n-gram格式的文档,便可以选择许多算法进行进一步分析,包括贝叶斯过滤器,N最近邻算法,支持向量机等.

Word n-grams will generally be more useful for most text analysis applications you mention with the possible exception of language detection, where something like character trigrams might give better results. Effectively, you would create n-gram vector for a corpus of text in each language you are interested in detecting and then compare the frequencies of trigrams in each corpus to the trigrams in the document you are classifying. For example, the trigram the probably appears much more frequently in English than in German and would provide some level of statistical correlation. Once you have your documents in n-gram format, you have a choice of many algorithms for further analysis, Baysian Filters, N- Nearest Neighbor, Support Vector Machines, etc..

在您提到的应用程序中,机器翻译可能是最牵强的,因为仅使用n-gram不会使您走得很远.将输入文件转换为n-gram表示形式只是将数据转换为用于进一步特征分析的格式的一种方法,但是由于您丢失了很多上下文信息,因此对于翻译可能没有用.

Of the applications you mention, machine translation is probably the most farfetched, as n-grams alone will not bring you very far down the path. Converting an input file to an n-gram representation is just a way to put the data into a format for further feature analysis, but as you lose a lot of contextual information, it may not be useful for translation.

要注意的一件事是,为一个文档创建向量[1,1,1,2,1]和为另一文档创建向量[2,1,2,4]是不够的,如果尺寸不匹配.也就是说,向量的第一个条目在一个文档中不能为the,在另一个文档中不能为is,否则算法将不起作用.您将得到[0,0,0,0,1,1,0,0,2,0,0,1]之类的向量,因为大多数文档将不包含您感兴趣的大多数n-gram.功能是至关重要的,它要求您预先"确定要在分析中包括的ngram.通常,这是通过两次通过算法实现的,首先要确定各种n-gram的统计意义,然后确定要保留的内容. Google的功能选择"以获取更多信息.

One thing to watch out for, is that it isn't enough to create a vector [1,1,1,2,1] for one document and a vector [2,1,2,4] for another document, if the dimensions don't match. That is, the first entry in the vector can not be the in one document and is in another or the algorithms won't work. You will wind up with vectors like [0,0,0,0,1,1,0,0,2,0,0,1] as most documents will not contain most n-grams you are interested in. This 'lining up' of features is essential, and it requires you to decide 'in advance' what ngrams you will be including in your analysis. Often, this is implemented as a two pass algorithm, to first decide the statistical significance of various n-grams to decide what to keep. Google 'feature selection' for more information.

基于单词的n-gram和支持向量机以出色的方式执行主题发现,但是您需要预先分类为按主题"和按主题"的大量文本集来训练分类器.在类似 citeseerx 的网站上,您会找到大量的研究论文来说明解决此问题的各种方法. .我不建议使用欧几里德距离方法,因为它不会基于统计显着性对单个n-gram加权,因此两个都包含theaisof的文档将是被认为比两个都包含Baysian的文档更好地匹配.从您感兴趣的n-gram中删除停用词会有所改善.

Word based n-grams plus Support Vector Machines in an excellent way to perform topic spotting, but you need a large corpus of text pre classified into 'on topic' and 'off topic' to train the classifier. You will find a large number of research papers explaining various approaches to this problem on a site like citeseerx. I would not recommend the euclidean distance approach to this problem, as it does not weight individual n-grams based on statistical significance, so two documents that both include the, a, is, and of would be considered a better match than two documents that both included Baysian. Removing stop-words from your n-grams of interest would improve this somewhat.

这篇关于N-gram:说明+ 2个应用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆