如何检测文本单词的主导语言? [英] How to detect the dominant language of a text word?

查看:30
本文介绍了如何检测文本单词的主导语言?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

它对于 string 看起来不错,但对于 word 来说它对我不起作用.当用户输入任何 3 个字符同时检查用户输入的语言时,我正在根据我的要求进行搜索.如果我认为它不应该与 detec0t 词一起使用,但我希望它应该与 Islam 词一起使用.

let tagger = NSLinguisticTagger(tagSchemes:[.tokenType, .language, .lexicalClass, .nameType, .lemma], options: 0)func确定语言(对于文本:字符串){tagger.string = 文本让语言 = tagger.dominantLanguage打印(语言是\(语言!)")}//测试用例确定语言(对于:我爱伊斯兰教")//en -pass确定语言(对于:আমি ইসলাম ভালোবাসি")//bn -pass确定语言(对于:أنا أحب الإسلام")//ar -pass确定语言(对于:伊斯兰教")//和 - 失败

结果:

<块引用>

语言是 en
语言是 bn
语言是 ar
语言是 und

我错过的未知语言"

解决方案

只是因为它属于太多语言,根据一个词来猜测语言是不现实的.上下文总是有帮助的.

例如:

import NaturalLanguage让识别器 = NLanguageRecognizer()识别器.processString(伊斯兰教")print(recognizer.dominantLanguage!.rawValue)//为简洁起见强制展开

打印tr,代表土耳其语.这是一个有根据的猜测.

如果您想要其他可能的语言,您可以使用 languageHypotheses(withMaximum:):

让假设=receiver.languageHypotheses(withMaximum: 10)for (lang, confidence) in hypotheses.sorted(by: { $0.value > $1.value }) {打印(lang.rawValue,信心)}

打印什么

<块引用>

tr 0.2332388460636139//土耳其语hr 0.1371040642261505//克罗地亚语en 0.12280254065990448//英文pt 0.08051242679357529德 0.06824589520692825NL 0.05405258387327194NB 0.050924140959978104它 0.037797268480062485pl 0.03097432479262352胡 0.0288708433508873

现在您可以定义一个可接受的置信度阈值以接受该结果.

<小时>

可以在此处找到语言代码>

It's looks good for string but it's not working for me for a word. I am working with search as per as my requirement when user typing any 3 character in the meantime looking to check which language user typing. if I think it should not work with detec0t word but i expect it should be working with Islam word.

let tagger = NSLinguisticTagger(tagSchemes:[.tokenType, .language, .lexicalClass, .nameType, .lemma], options: 0)

func determineLanguage(for text: String) {
    tagger.string = text
    let language = tagger.dominantLanguage
    print("The language is \(language!)")
}


//Test case
determineLanguage(for: "I love Islam") // en -pass
determineLanguage(for: "আমি ইসলাম ভালোবাসি") // bn -pass
determineLanguage(for: "أنا أحب الإسلام") // ar -pass
determineLanguage(for: "Islam") // und - failed

Result:

The language is en
The language is bn
The language is ar
The language is und

What I missed for "Unknown language"

解决方案

Simply because it belongs to too many languages and it would be unrealistic to guess the language based on one word. The context always helps.

For example :

import NaturalLanguage

let recognizer = NLLanguageRecognizer()
recognizer.processString("Islam")
print(recognizer.dominantLanguage!.rawValue)  //Force unwrapping for brevity

prints tr, which stands for Turkish. It's an educated guess.

If you want the other languages that were also possible, you could use languageHypotheses(withMaximum:):

let hypotheses = recognizer.languageHypotheses(withMaximum: 10)

for (lang, confidence) in hypotheses.sorted(by: { $0.value > $1.value }) {
    print(lang.rawValue, confidence)
}

Which prints

tr 0.2332388460636139   //Turkish
hr 0.1371040642261505   //Croatian
en 0.12280254065990448  //English
pt 0.08051242679357529
de 0.06824589520692825
nl 0.05405258387327194
nb 0.050924140959978104
it 0.037797268480062485
pl 0.03097432479262352
hu 0.0288708433508873

Now you could define an acceptable threshold of confidence in order to accept that result.


Language codes can be found here

这篇关于如何检测文本单词的主导语言?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆