统计语言模型:比较不同长度的单词序列 [英] Statistical language model: comparing word sequences of different lengths

查看:113
本文介绍了统计语言模型:比较不同长度的单词序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从文本中提取公司名称的算法.它通常做得很好,但是有时它还会提取看起来像公司名称的字符串,但显然不是.例如,联系我们",科罗拉多斯普林斯公司",美容牙医"显然不是公司名称.此类误报过多,无法将其列入黑名单,因此我想介绍一种对提取的字符串进行排名的算法,以便可以将排名最低的字符串丢弃.

I have an algorithm that extracts company names from text. It generally does a good job, however, it also sometimes extracts strings that look like company names, but obviously aren't. For example, "Contact Us", "Colorado Springs CO", "Cosmetic Dentist" are obviously not company names. There are too many of such false positives to blacklist, so I want to introduce an algorithmic way of ranking the extracted strings, so that the lowest-ranking ones can be discarded.

目前,我正在考虑使用统计语言模型来做到这一点.该模型可以根据字符串中每个单词的概率乘积对每个字符串评分(考虑最简单的字母组合模型).我的问题是:可以使用这种模型比较不同长度的单词序列吗?由于定义上的概率小于1,因此较长序列的概率通常将小于较短序列的概率.这会使模型偏向于较长的序列,这不是一件好事.

Currently, I'm thinking of using a statistical language model to do this. This model can score each string based on the product of the probabilities of each individual word in the string (considering the simplest unigram model). My question is: can such a model be used to compare word sequences of different lengths? Since probabilities are by definition less than 1, the probabilities for longer sequences is usually going to be smaller than for shorter sequences. This would bias the model against longer sequences, which isn't a good thing.

有没有一种方法可以使用这种统计语言模型比较不同长度的单词序列?或者,是否有更好的方法来对序列进行评分?

Is there a way to compare word sequences of different lengths using such statistical language models? Alternatively, is there a better way to achieve to score the sequences?

例如,使用bigram模型和一些现有数据,这就是我得到的:

For example, with a bigram model and some existing data, this is what I get:

python slm.py About NEC
        <s> about 6
        about nec 1
        nec </s> 1
4.26701019773e-17
python slm.py NEC
        <s> nec 6
        nec </s> 1
2.21887517189e-11
python slm.py NEC Corporation
        <s> nec 6
        nec corporation 3
        corporation </s> 3593
4.59941029214e-13
python slm.py NEC Corporation of
        <s> nec 6
        nec corporation 3
        corporation of 41
        of </s> 1
1.00929844083e-20
python slm.py NEC Corporation of America
        <s> nec 6
        nec corporation 3
        corporation of 41
        of america 224
        america </s> 275
1.19561436587e-21

缩进线显示了模型中的二元组及其频率. <s></s>分别是句子的开头和结尾.问题是,句子的长度越长,概率越小,而不管其构成的二元组在数据库中出现的频率如何.

The indented lines show the bigrams and their frequency in the model. <s> and </s> are start and end of sentence, respectively. The problem is, the longer the sentence, the less probable it is, regardless of how often its constituent bigrams occur in the database.

推荐答案

您可以根据句子长度对分数进行归一化,还是使用

Can you normalize the scores based on sentence lengths, or use EM algorithm over unigram, bigram and trigram models?

在9/24上

您可以尝试几种替代方法.一种方法是对unigram,bigram和trigram模型进行最大似然估计,然后进行线性插值(请参见: http://www.cs.columbia.edu/~mcollins/lm-spring2013.pdf ).对于位置i处的每个单词,您可以确定(i + 1)是句子的结尾还是最有可能出现的单词.此方法要求您设置培训和测试数据集以评估性能(困惑度).

There are probably a few alternatives you could try. One way is to make maximum-likelihood estimates on unigram, bigram and trigram models and take a linear interpolation(See: http://www.cs.columbia.edu/~mcollins/lm-spring2013.pdf). For each word at position i, you can determine if (i+1) is the end of the sentence or which word mostly likely would appear. This method requires you to set up training and testing data sets to evaluate the performance (perplexity).

我会避免每个单词的概率简单相乘.当单词不是独立的时,例如,P(NEC,Corporation)!= P(NEC)* P(公司).

I would avoid simple multiplications of the probabilities of each individual word. When words are not independent, for example, P (NEC, Corporation) != P (NEC) * P (Corporation).

这篇关于统计语言模型:比较不同长度的单词序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆