你能以编程方式检测英语单词的复数形式,并推导出单数形式吗? [英] Can you programmatically detect pluralizations of English words, and derive the singular form?

查看:27
本文介绍了你能以编程方式检测英语单词的复数形式,并推导出单数形式吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一些(英语)单词,我们假设它是复数,是否可以推导出单数形式?如果可能,我想避免查找/字典表.

一些例子:

<前>示例 -> 示例一个简单的s"后缀Glitch -> Glitches 'es' 后缀,与上面相反国家 -> 国家ies"后缀.绵羊 -> 绵羊没有变化:不确定值的可能回退

或者,这似乎是一个相当详尽的列表.>

x 语言的库的建议是好的,只要它们是开源的(即,以便有人可以检查它们以确定如何使用y 语言来实现)代码>)

解决方案

这实际上取决于您所说的以编程方式"是什么意思.英语的一部分适用于易于理解的规则,而另一部分则不是.它主要与频率有关.对于简要概述,您可以阅读 Pinker 的Words and Rules",但请帮自己一个忙,不要将整个语言学的生成理论完全放在心上.那里的经验主义比那种学派真正为追求提供的要多得多.

很多英语可以在统计上进行词形还原.顺便说一下,词干提取或词形还原是您正在寻找的术语.最有效的词形还原器之一是 Morpha Lemmatizer.如果您的项目需要对表示英语特定术语的字符串进行这种类型的简化,您可以试一试.

在规范化相关术语方面,还有更幼稚的方法可以完成很多工作.看看 Porter Stemmer,它足以有效地将大多数聚集在一起em> 英文术语.

Given some (English) word that we shall assume is a plural, is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible.

Some examples:

Examples  -> Example    a simple 's' suffix
Glitch    -> Glitches   'es' suffix, as opposed to above
Countries -> Country    'ies' suffix.
Sheep     -> Sheep      no change: possible fallback for indeterminate values

Or, this seems to be a fairly exhaustive list.

Suggestions of libraries in language x are fine, as long as they are open-source (ie, so that someone can examine them to determine how to do it in language y)

解决方案

It really depends on what you mean by 'programmatically'. Part of English works on easy to understand rules, and part doesn't. It has to do mainly with frequency. For a brief overview, you can read Pinker's "Words and Rules", but do yourself a favor and don't take the whole generative theory of linguistics entirely to heart. There's a lot more empiricism there than that school of thought really lends to the pursuit.

A lot of English can be statistically lemmatized. By the way, stemming or lemmatization is the term you're looking for. One of the most effective lemmatizers which work off of statistical rules bootstrapped with frequency-based exceptions is the Morpha Lemmatizer. You can give this a shot if you have a project that requires this type of simplification of strings which represent specific terms in English.

There are even more naive approaches that accomplish much with respect to normalizing related terms. Take a look at the Porter Stemmer, which is effective enough to cluster together most terms in English.

这篇关于你能以编程方式检测英语单词的复数形式,并推导出单数形式吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆