Java中数据规范化的拼写更正 [英] Spelling correction for data normalization in Java

查看:29
本文介绍了Java中数据规范化的拼写更正的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个 Java 库来对用户生成的文本内容进行一些初始拼写检查/数据规范化,想象一下在 Facebook 个人资料中输入的兴趣.

I am looking for a Java library to do some initial spell checking / data normalization on user generated text content, imagine the interests entered in a Facebook profile.

此文本将在某个时间点被标记化(在拼写更正之前或之后,无论哪个效果更好),其中一些用作搜索键(精确匹配).最好减少拼写错误等以产生更多匹配.如果校正在比一个单词更长的标记上表现良好,那就更好了,例如trinking coffee"会变成drinking coffee"而不是thinking coffee".

This text will be tokenized at some point (before or after spell correction, whatever works better) and some of it used as keys to search for (exact match). It would be nice to cut down misspellings and the like to produce more matches. It would be even better if the correction would perform well on tokens longer than just one word, e.g. "trinking coffee" would become "drinking coffee" and not "thinking coffee".

我找到了以下用于进行拼写纠正的 Java 库:

I found the following Java libraries for doing spelling correction:

  1. JAZZY 似乎并未处于积极开发中.此外,由于在社交网络配置文件和多词标记中使用了非标准语言,基于字典距离的方法似乎不够用.
  2. APACHE LUCENE 似乎有一个统计拼写检查器那应该更合适.这里的问题是如何创建一个好的字典?(我们没有使用 Lucene,所以没有现有的索引.)
  1. JAZZY does not seem to be under active development. Also, the dictionary-distance based approach seems inadequate because of the use of non-standard language in social network profiles and multi-word tokens.
  2. APACHE LUCENE seems to have a statistical spell checker that should be much more suited. Question here would how to create a good dictionary? (We are not using Lucene otherwise, so there is no existing index.)

欢迎提出任何建议!

推荐答案

您要实现的不是拼写校正器,而是模糊搜索.Peter Norvig 的文章是一个很好的起点,可以根据字典检查的候选人构建模糊搜索.

What you want to implement is not spelling corrector but a fuzzy search. Peter Norvig's essay is a good starting point to build a fuzzy search from candidates checked against a dictionary.

或者看看 BK-Trees.

Alternatively have a look at BK-Trees.

n-gram 索引(由 Lucene 使用)对较长的单词产生更好的结果.生成达到给定编辑距离的候选者的方法对于在普通文本中找到的单词可能足够好,但对于名称、地址和科学文本来说不够好.不过,它会增加您的索引大小.

An n-gram index (used by Lucene) produces better results for longer words. The approach to produce candidates up to a given edit distance will probably work good enough for words found in normal text but will not work good enough for names, addresses and scientific texts. It will increase you index size, though.

如果您将文本编入索引,那么您就有了文本语料库(您的字典).无论如何,只能找到您数据中的内容.您无需使用外部字典.

If you have the texts indexed you have your text corpus (your dictionary). Only what is in your data can be found anyway. You need not use an external dictionary.

一个很好的资源是介绍信息检索 - 字典和容错检索 .有上下文相关拼写更正的简短描述.

A good resource is Introduction to Information Retrieval - Dictionaries and tolerant retrieval . There is a short description of context sensitive spelling correction.

这篇关于Java中数据规范化的拼写更正的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆