识别人名与词典词 [英] Identifying a person's name vs. a dictionary word

查看:65
本文介绍了识别人名与词典词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有某种方法可以识别一个单词可能是/不太可能是一个人的名字?

Is there some way to recognize that a word is likely to be/is not likely to be a person's name?

因此,如果我看到理解"一词,则将获得0.01的概率,而约翰逊"一词将返回0.99的概率,而史密斯等单词将返回0.75,而苹果公司则为0.15.

So if I see the word "understanding" I would get a probability of 0.01, whereas the word "Johnson" would return a probability of 0.99, while a word like Smith would return 0.75 and a word like Apple 0.15.

有没有办法做到这一点?

Is there any way to do this?

目标是,如果有人进行搜索,例如说Charles Darwin galapagos,则搜索引擎猜测它应该在author字段中搜索CharlesDarwin,在title和abstract字段中搜索galapagos.

The goal is, if someone searches for, say Charles Darwin galapagos, the search engine guesses that it should search the author field for Charles and Darwin and the title and abstract fields for galapagos.

推荐答案

我的快速技巧是:

可从人口普查局的地名列表中按受欢迎程度获取列表,该列表是免费提供的.给每个名字一个标准化的流行度分数(1.0 =流行度,0.0 =最少).

Get the list from the census bureau of names in order of popularity, it's freely available. Give each name a normalized popularity score (1.0 = most popular, 0.0 = least).

然后,获取开源词典,并进行一些研究以汇总每个单词的频率得分.您可以在wiktionary的此处找到一个.给每个单词分配一个流行度评分,即1.0到0.0.方便的是,如果您在频率列表中找不到一个单词,就可以假设它是一个非常不常见的单词.

Then, get an opensource dictionary, and do some research to pull together a frequency score for every word. You can find one here, at wiktionary. Assign every word a popularity score, 1.0 to 0.0. The convenient thing is that if you can't find a word on the frequency list, you get to assume it's a pretty uncommon word.

在两个列表中都查找一个单词.如果仅在一个或另一个上,则说明已完成.如果两者都使用,则使用公式来计算加权概率...类似(名称流行度)/(名称流行度+其他流行度).如果不在任何一个列表中,则可能是名称.

Look for a word on both lists. If it's on just one or the other, you're done. If it's on both, use a formula to compute a weighted probability... something like (Name Popularity) / (Name Popularity + Other Popularity). If it's not on either list, it's probably a name.

这篇关于识别人名与词典词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆