Java中数据标准化的拼写校正 [英] Spelling correction for data normalization in Java

查看:142
本文介绍了Java中数据标准化的拼写校正的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个Java库来对用户生成的文本内容进行一些初始拼写检查/数据标准化,想象一下在Facebook个人资料中输入的兴趣.

I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile.

此文本将在某个时候被标记化(在进行拼写校正之前或之后,无论哪种方法效果更好),其中一些文本用作搜索(完全匹配)的键.减少拼写错误和类似操作以产生更多匹配结果将是很好的.如果校正在令牌上的效果要好于仅一个单词,例如更长的单词,那会更好. 喝咖啡"将变成喝咖啡",而不是思维咖啡".

This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee".

我找到了以下Java库来进行拼写校正:

I found the following Java libraries for doing spelling correction:

  1. JAZZY 似乎没有得到积极发展.此外,由于在社交网络配置文件和多词标记中使用了非标准语言,因此基于字典距离的方法似乎是不够的.
  2. APACHE LUCENE 似乎有一个统计拼写检查器那应该更合适.问题在这里将如何创建一个好的字典? (否则,我们不使用Lucene,因此不存在索引.)
  1. JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens.
  2. APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.)

欢迎提出任何建议!

推荐答案

您要实现的不是拼写校正器,而是模糊搜索.彼得·诺维格(Peter Norvig)的文章是从针对字典进行检查的候选对象进行模糊搜索的良好起点.

What you want to implement is not spelling corrector but a fuzzy search. Peter Norvig's essay is a good starting point to build a fuzzy search from candidates checked against a dictionary.

或者看看BK树.

n-gram索引(由Lucene使用)对于较长的单词产生更好的结果.产生到给定编辑距离的候选词的方法可能对普通文本中的单词足够好,但对名称,地址和科学文本而言却不够好.不过,它会增加您的索引大小.

An n-gram index (used by Lucene) produces better results for longer words. The approach to produce candidates up to a given edit distance will probably work good enough for words found in normal text but will not work good enough for names, addresses and scientific texts. It will increase you index size, though.

如果您将文本编入索引,则您将拥有文本语料库(您的字典).无论如何,只能找到您数据中的内容.您无需使用外部词典.

If you have the texts indexed you have your text corpus (your dictionary). Only what is in your data can be found anyway. You need not use an external dictionary.

好的资源是信息检索-词典和容忍检索.关于上下文敏感的拼写更正的简短描述.

A good resource is Introduction to Information Retrieval - Dictionaries and tolerant retrieval . There is a short description of context sensitive spelling correction.

这篇关于Java中数据标准化的拼写校正的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆