具有语言检测功能的多语言拼写检查 [英] Multilingual spell checking with language detection

查看:82
本文介绍了具有语言检测功能的多语言拼写检查的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究混合语言网页的拼写检查,但尚未找到有关该主题的任何现有研究.

I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.

目标是在混合语言网页中自动检测句子级别的语言 并自动针对其相应的语言进行拼写检查.假设我们可以忽略将多种语言混合在一起的句子(例如他有某种语言"),并假设网页不能包含2种或3种以上的语言.

The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.

常用示例(威尔士语和英语): http://wales.gov.uk/

Trivial example (Welsh + English): http://wales.gov.uk/

我目前正在混合使用:

  • 字符分布(例如0600-06FF =阿拉伯语等)
  • n语法来识别具有相似字符的语言
  • 字典查找以区分语言环境,即en-US,en-GB

我有有效的代码,但担心它可能太幼稚或不必要地重新发明了轮子.以前有人做过吗?

I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?

推荐答案

您可以使用API​​(Google& Yandex)进行拼写检查和语言检测-但我认为此选项的伸缩性不是很好.

You can use API (Google & Yandex) for spell check and language detection - but this option is not very scalable I think.

其他选择是使用免费的Lucene工具进行拼写检查 http://wiki.apache.org/lucene-java/SpellChecker ,但是您必须先索引一些corpra-Wikipedia是不错的选择. LD可以通过 http://textcat.sourceforge.net/

Other option is to use free lucene tools for spellchecking http://wiki.apache.org/lucene-java/SpellChecker, but you have to index some corpra first - Wikipedia is good choice. LD can be archived by http://textcat.sourceforge.net/

这篇关于具有语言检测功能的多语言拼写检查的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆