您能以编程方式检测英语单词的复数形式并得出单数形式吗? [英] Can you programmatically detect pluralizations of English words, and derive the singular form?

查看:124
本文介绍了您能以编程方式检测英语单词的复数形式并得出单数形式吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于我们假设的一些英语单词为复数形式,是否可以导出单数形式?如果可能的话,我想避免使用查找/字典表.

一些例子:

Examples  -> Example    a simple 's' suffix
Glitch    -> Glitches   'es' suffix, as opposed to above
Countries -> Country    'ies' suffix.
Sheep     -> Sheep      no change: possible fallback for indeterminate values

或者,这似乎是一个详尽无遗的清单.

使用x语言的图书馆建议是可以的,只要它们是开源的(即,以便有人可以检查它们以确定如何使用y语言)

解决方案

这实际上取决于您以编程方式"的含义.英语的一部分工作于易于理解的规则,而另一部分则没有.它主要与频率有关.对于简短的概述,您可以阅读Pinker的单词和规则",但请帮自己一个忙,不要完全将语言学的整个生成理论牢记在心.经验主义远不止于思想流派真正有助于追求.

很多英语都可以在统计上进行词素化.顺便说一句,词干或词条限制是您要寻找的术语. Porter Stemmer ,它足以有效地聚在一起. em>英文术语.

Given some (English) word that we shall assume is a plural, is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible.

Some examples:

Examples  -> Example    a simple 's' suffix
Glitch    -> Glitches   'es' suffix, as opposed to above
Countries -> Country    'ies' suffix.
Sheep     -> Sheep      no change: possible fallback for indeterminate values

Or, this seems to be a fairly exhaustive list.

Suggestions of libraries in language x are fine, as long as they are open-source (ie, so that someone can examine them to determine how to do it in language y)

解决方案

It really depends on what you mean by 'programmatically'. Part of English works on easy to understand rules, and part doesn't. It has to do mainly with frequency. For a brief overview, you can read Pinker's "Words and Rules", but do yourself a favor and don't take the whole generative theory of linguistics entirely to heart. There's a lot more empiricism there than that school of thought really lends to the pursuit.

A lot of English can be statistically lemmatized. By the way, stemming or lemmatization is the term you're looking for. One of the most effective lemmatizers which work off of statistical rules bootstrapped with frequency-based exceptions is the Morpha Lemmatizer. You can give this a shot if you have a project that requires this type of simplification of strings which represent specific terms in English.

There are even more naive approaches that accomplish much with respect to normalizing related terms. Take a look at the Porter Stemmer, which is effective enough to cluster together most terms in English.

这篇关于您能以编程方式检测英语单词的复数形式并得出单数形式吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆