如何检测文档的语言 [英] How to detect the language of a document

查看:81
本文介绍了如何检测文档的语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


我正在使用Google API,但是它不能用于整个文档.我必须搜索整个pdf或doc文件,以检测文本的语言是否为英语.
plz 请给我示例代码或链接.
迫切


在桌面应用程序上工作..

Hi,
I am using Google API, but it can''t work for the whole document. I have to search a whole pdf or doc-file to detect that the language of a text is english or not.
plz Please give me sample code or a link.
its urgent


working on desktop application ..

推荐答案

文档未分类为英语和非英语.如果文档以三种语言编写,而其中一种是英语,该怎么办?您将如何对此类文档进行分类?您应该已经解释了您想要什么.

现在,许多英国作家在他们的文本中引用了拉丁语表达. (其他语言也是如此,但尤其是拉丁语.)大多数这样的拉丁语短语都可以用ASCII表示.此外,许多英语符号都使用Unicode,例如用于印刷正确的引号或破折号的符号,例如,等等.即使有字符也不是那么容易.信不信由你,"连字灰"( http://en.wikipedia.org/wiki/%C3%86 [ ^ ])是英语!请参阅 http://en.wikipedia.org/wiki/English_alphabet [
坦率而正确地讲,这种问题不能通过纯粹的技术手段来解决.与代码点范围不同,该语言未在任何地方标记.语言是与书写系统或脚本完全不同的东西.在HTML中,有一个"lang"属性,但是没有人必须使用它.潜在地,只有通过创建功能强大的专家系统(使用多种语言和语法规则集的全面词典)才能解决此问题.分析结果只能用模糊集理论或模糊逻辑来表示(
http://en.wikipedia.org/wiki/Fuzzy_set [ ^ ],
有趣的?愿意深入研究吗?那祝你好运.

—SA
Documents are not classified into English and non-English. What if a document is written in three languages and one of those is English. How would you classify such document? You should have explained what do you want.

Now, many English writers quote Latin expressions in their texts. (Other languages, too, but especially Latin.) Most such Latin phrases can be expressed in ASCII. Moreover, many symbols in English use Unicode, such as those used for typographically correct quotation or dash characters, such as " " —, and a lot more. Even with characters it''s not so easy. Believe or not, the "ligature ash" (http://en.wikipedia.org/wiki/%C3%86[^]) is English! See http://en.wikipedia.org/wiki/English_alphabet[^].

How would you want to classify such document? And you won''t be able to analyze such citation based on the classification of code points, as I''ve demonstrated above.

There are other cases. For example, many Polish words use the same code point sub-set as English. There are exclusions like "Ł" or "ę". So, one can find some words which have some meaning in Polish and some meaning in English, maybe completely different. The same very word can be Polish or English at the same time, depending on context.

Honestly and correctly, such problem cannot be resolved by a purely technical technique. Unlike code point ranges, the language is not marked anywhere. Language is something completely different form a writing system or a script. In HTML there is a "lang" attribute, but nobody is obliged to use it. Potentially, such problem can only be solved by creation of powerful expert system which uses comprehensive dictionaries of many languages and grammar rule sets. The results of analysis can only be expressed in terms of fuzzy set theory or fuzzy logic (http://en.wikipedia.org/wiki/Fuzzy_set[^], http://en.wikipedia.org/wiki/Fuzzy_logic[^]): "this text is English with 96.4% certainty". Something like that.

Interesting? Care to delve into that? Good luck then.

—SA


这篇关于如何检测文档的语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆