如何在iOS中检测文本(字符串)语言? [英] How to detect text (string) language in iOS?

查看:370
本文介绍了如何在iOS中检测文本(字符串)语言?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,给出以下字符串:

 让textEN =快速的棕色狐狸跳过懒狗
let textES =Elzorromarrónrápidosaltasobre el perro perezoso
let textAR =الثعلبالبنيالسريعيقفزفوقالكلبالكسول
let textDE =Der schnelle braune Fuchsspringtüberdenfaulen Hund

我想检测每个声明的字符串中使用的语言。



我们假设已实现函数的签名是:

  func detectedLangauge< T:StringProtocol>( _ forString:T) - >串? 

在没有检测到语言的情况下返回可选字符串。



因此适当的结果是:

  let englishDetectedLangauge = detectedLangauge(textEN)// =>英语
let spanishDetectedLangauge = detectedLangauge(textES)// =>西班牙语
let arabicDetectedLangauge = detectedLangauge(textAR)// =>阿拉伯语
让germanDetectedLangauge = detectedLangauge(textDE)// =>德语

有没有简单的方法来实现它?

解决方案

快速回答:



从iOS 11+起,您可以使用


根据调用 dominantLanguage(for :)的返回值通过快速棕色狐狸跳过懒狗:

  NSLinguisticTagger.dominantLanguage(for:快速的棕色狐狸跳过懒狗)

将是en可选字符串。然而,到目前为止,这不是理想的输出,期望是获得英语!那么,这正是你应该通过调用 localizedString得到的。 区域设置结构中的(forLanguageCode:) 方法并传递得到的语言代码:

  Locale.current.localizedString(forIdentifier:en)// English 



全部放在一起:



如快速中所述回答代码片段,函数将是:

  func detectedLangauge< T:StringProtocol>(_ forString:T) - >串? {
guard let languageCode = NSLinguisticTagger.dominantLanguage(for:String(forString))else {
return nil
}

let detectedLangauge = Locale.current.localizedString( forIdentifier:languageCode)

return detectedLangauge
}

输出:



这将是预期的:

  let englishDetectedLangauge = detectedLangauge(textEN)// =>英语
let spanishDetectedLangauge = detectedLangauge(textES)// =>西班牙语
let arabicDetectedLangauge = detectedLangauge(textAR)// =>阿拉伯语
让germanDetectedLangauge = detectedLangauge(textDE)// =>德语

注意:



仍然存在没有获得给定字符串的语言名称的情况,例如:

  let textUND =SdsOE 
let undefinedDetectedLanguage = detectedLangauge(textUND)// =>未知语言

或者甚至可以 nil

  let rabish =000747322
let rabishDetectedLanguage = detectedLangauge(rabish)// => nil

仍然发现提供有用的输出效果不错......






此外:



关于NSLinguisticTagger:



虽然我不打算深入研究 NSLinguisticTagger 的使用情况,但我想注意到有几个真的其中存在很酷的功能,而不仅仅是检测给定文本的语言;作为一个非常简单的示例:枚举标记时使用引理在使用信息检索,因为您可以识别驾驶一词传递驱动字样。



官方资源



Apple视频会话





此外,为了熟悉CoreML:




For instance, given the following strings:

let textEN = "The quick brown fox jumps over the lazy dog"
let textES = "El zorro marrón rápido salta sobre el perro perezoso"
let textAR = "الثعلب البني السريع يقفز فوق الكلب الكسول"
let textDE = "Der schnelle braune Fuchs springt über den faulen Hund"

I want to detect the used language in each of declared string.

Let's assume the signature for the implemented function is:

func detectedLangauge<T: StringProtocol>(_ forString: T) -> String?

returns an Optional string in case of no detected language.

thus the appropriate result would be:

let englishDetectedLangauge = detectedLangauge(textEN) // => English
let spanishDetectedLangauge = detectedLangauge(textES) // => Spanish
let arabicDetectedLangauge = detectedLangauge(textAR) // => Arabic
let germanDetectedLangauge = detectedLangauge(textDE) // => German

Is there an easy approach to achieve it?

解决方案

Quick Answer:

Since iOS 11+, you could achieve it by using NSLinguisticTagger. Implementing desired function like this:

func detectedLangauge<T: StringProtocol>(_ forString: T) -> String? {
    guard let languageCode = NSLinguisticTagger.dominantLanguage(for: String(forString)) else {
        return nil
    }

    let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)

    return detectedLangauge
}

should achieve what are you asking for.



Described Answer:

First of all, you should be aware of what are you asking about is mainly relates to the world of Natural language processing (NLP).

Since NLP is more than text language detection, the rest of the answer will not contains specific NLP information.

Obviously, implementing such a functionality is not that easy, especially when starting to care about the details of the process such as splitting into sentences and even into words, after that recognising names and punctuations etc... I bet you would think of "what a painful process! it is not even logical to do it by myself"; Fortunately, iOS does supports NLP (actually, NLP APIs are available for all Apple platforms, not only the iOS) to make what are you aiming for to be easy to be implemented. The core component that you would work with is NSLinguisticTagger:

Analyze natural language text to tag part of speech and lexical class, identify names, perform lemmatization, and determine the language and script.

NSLinguisticTagger provides a uniform interface to a variety of natural language processing functionality with support for many different languages and scripts. You can use this class to segment natural language text into paragraphs, sentences, or words, and tag information about those segments, such as part of speech, lexical class, lemma, script, and language.

As mentioned in the class documentation, the method that you are looking for - under Determining the Dominant Language and Orthography section- is dominantLanguage(for:):

Returns the dominant language for the specified string.

.

.

Return Value

The BCP-47 tag identifying the dominant language of the string, or the tag "und" if a specific language cannot be determined.

You might notice that the NSLinguisticTagger is exist since back to iOS 5. However, dominantLanguage(for:) method is only supported for iOS 11 and above, that's because it has been developed on top of the Core ML Framework:

. . .

Core ML is the foundation for domain-specific frameworks and functionality. Core ML supports Vision for image analysis, Foundation for natural language processing (for example, the NSLinguisticTagger class), and GameplayKit for evaluating learned decision trees. Core ML itself builds on top of low-level primitives like Accelerate and BNNS, as well as Metal Performance Shaders.

Based on the returned value from calling dominantLanguage(for:) by passing "The quick brown fox jumps over the lazy dog":

NSLinguisticTagger.dominantLanguage(for: "The quick brown fox jumps over the lazy dog")

would be "en" optional string. However, so far that is not the desired output, the expectation is to get "English" instead! Well, that is exactly what you should get by calling the localizedString(forLanguageCode:) method from Locale Structure and passing the gotten language code:

Locale.current.localizedString(forIdentifier: "en") // English

Putting all together:

As mentioned in the "Quick Answer" code snippet, the function would be:

func detectedLangauge<T: StringProtocol>(_ forString: T) -> String? {
    guard let languageCode = NSLinguisticTagger.dominantLanguage(for: String(forString)) else {
        return nil
    }

    let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)

    return detectedLangauge
}

Output:

It would be as expected:

let englishDetectedLangauge = detectedLangauge(textEN) // => English
let spanishDetectedLangauge = detectedLangauge(textES) // => Spanish
let arabicDetectedLangauge = detectedLangauge(textAR) // => Arabic
let germanDetectedLangauge = detectedLangauge(textDE) // => German

Note That:

There still cases for not getting a language name for a given string, like:

let textUND = "SdsOE"
let undefinedDetectedLanguage = detectedLangauge(textUND) // => Unknown language

Or it could be even nil:

let rabish = "000747322"
let rabishDetectedLanguage = detectedLangauge(rabish) // => nil

Still find it a not bad result for providing a useful output...


Furthermore:

About NSLinguisticTagger:

Although I will not going to dive deep in NSLinguisticTagger usage, I would like to note that there are couple of really cool features exist in it more than just simply detecting the language for a given a text; As a pretty simple example: using the lemma when enumerating tags would be so helpful when working with Information retrieval, since you would be able to recognize the word "driving" passing "drive" word.

Official Resources

Apple Video Sessions:

Also, for getting familiar with the CoreML:

这篇关于如何在iOS中检测文本(字符串)语言?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆