具有语言检测功能的多语言拼写检查 [英] Multilingual spell checking with language detection
问题描述
我正在研究混合语言网页的拼写检查,但尚未找到有关该主题的任何现有研究.
I'm working on spell checking of mixed language webpages, and haven't been able to find any existing research on the subject.
目标是在混合语言网页中自动检测句子级别的语言 并自动针对其相应的语言进行拼写检查.假设我们可以忽略将多种语言混合在一起的句子(例如他有某种语言"),并假设网页不能包含2种或3种以上的语言.
The aim is to automatically detect language at a sentence level within mixed language webpages and spell check each against their appropriate language automatically. Assume that we can ignore sentences which mix multiple languages together (e.g. "He has a certain je ne sais quoi"), and assume webpages can't contain more than 2 or 3 languages.
常用示例(威尔士语和英语): http://wales.gov.uk/
Trivial example (Welsh + English): http://wales.gov.uk/
我目前正在混合使用:
- 字符分布(例如0600-06FF =阿拉伯语等)
- n语法来识别具有相似字符的语言
- 字典查找以区分语言环境,即en-US,en-GB
我有有效的代码,但担心它可能太幼稚或不必要地重新发明了轮子.以前有人做过吗?
I have working code but am concerned it may be naive or needlessly re-inventing a wheel. Has anyone else done this before?
推荐答案
您可以使用API(Google& Yandex)进行拼写检查和语言检测-但我认为此选项的伸缩性不是很好.
You can use API (Google & Yandex) for spell check and language detection - but this option is not very scalable I think.
其他选择是使用免费的Lucene工具进行拼写检查 http://wiki.apache.org/lucene-java/SpellChecker ,但是您必须先索引一些corpra-Wikipedia是不错的选择. LD可以通过 http://textcat.sourceforge.net/
Other option is to use free lucene tools for spellchecking http://wiki.apache.org/lucene-java/SpellChecker, but you have to index some corpra first - Wikipedia is good choice. LD can be archived by http://textcat.sourceforge.net/
这篇关于具有语言检测功能的多语言拼写检查的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!