使用PostgreSQL中的数据进行语言检测 [英] Language detection with data in PostgreSQL
问题描述
我在PostgreSQL中有一个表,其中列是文本。我需要一个库或工具来识别每个文本的语言以用于测试目的。
不需要PostgreSQL代码,因为我遇到安装问题语言,但任何可以连接到数据库,检索文本并识别它的语言都受欢迎。
我使用 Lingua :: Identify $在Perl脚本的答案中建议使用c $ c>,但结果不准确。
我要识别的文本来自网络和大多数是葡萄牙语,但 Lingua :: Identify
分类为法语,意大利语和西班牙语,类似语言。
<我需要更精确的东西。
我添加了 java
和 r
标签因为我在系统中使用的语言和使用它们的解决方案很容易实现,但欢迎使用任何语言的解决方案。
试试这些:
此博客文章分享一些测试来比较2个库(以及第3个 - Apache Tika的语言识别模块,它实际上是一个完整的文本分析工具包)。
I have a table in PostgreSQL where a column is a text. I need a library or tool that can identify the language of each text for a test purpose.
There is no need for a PostgreSQL code because I'm having problems to install languages, but any language that can connect to the database, retrieve the texts and identify it arewelcome.
I used Lingua::Identify
suggested in the answers right in the Perl script, it worked, but the results are not precise.
The texts I want to identify comes from the web and most are in portuguese, but Lingua::Identify
is classifying much as french, italian and spanish that are similar languages.
I need something more precise.
I added the java
and r
tags because are the languages I'm using in the system and solution using they will be easy to implement, but solutions in any language are welcome.
Try these:
- http://code.google.com/p/language-detection/ (Java)
- http://code.google.com/p/chromium-compact-language-detector/ (C++/Python)
This blog post shares some tests to compare the 2 libraries (along with a 3rd - the Language Identification module of Apache Tika, which really is a complete toolkit for Text Analysis).
这篇关于使用PostgreSQL中的数据进行语言检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!