如何检测波斯语网页由蒂卡? [英] how can I detect farsi web pages by tika?

查看:183
本文介绍了如何检测波斯语网页由蒂卡?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个样品code帮我检测波斯语网页被Apache蒂卡工具。

  LanguageIdentifier标识符=新LanguageIdentifier(فارسی);
        String语言= identifier.getLanguage();

我有下载apache.tika jar文件,并将它们添加到classpath中。但这code给出错误波斯语,但它适用于英语。
我怎么可以添加到波斯语的蒂卡包languageIdentifier?


解决方案

提卡不为波斯语的语言轮廓出货呢。随着1.0版本<一个href=\"http://svn.apache.org/viewvc/tika/trunk/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties?view=markup\">27语言支持的开箱即用:

<$p$p><$c$c>languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

在您的例子中的输入与距离的0.41,高于0.022确定性门槛误检测为(立陶宛)。请参阅<一个href=\"http://grep$c$c.com/file/repo1.maven.org/maven2/org.apache.tika/tika-core/1.0/org/apache/tika/language/LanguageIdentifier.java#LanguageIdentifier.0distance\">source code 上的 LanguageIdentifier 内作品的更多信息。

在波斯语(波斯,ISO 639-1 2个字母$ C $ ç FA )默认情况下不认可。
如果你想提卡认识到另一种语言,你必须首先创建一个语言配置文件。

有关此以下步骤是必须的:


  1. 查找您的语言的语料库。我发现 Hamshahri收集。这应该足够了。下载文集或部分,并创建一个纯文本文件了XML的。


  2. 创建语言标识符的NGRAM文件。这可以通过使用 TikaCLI 来完成:

    Java的罐子蒂卡-APP-1.0.jar --create知名度=发-eUTF-8 FA-corpus.txt
    这将一个名为 fa.ngp 包含正克。


  3. 配置提卡以便它能够识别新的语言。要么做到这一点编程方式使用 LanguageIdentifier.initProfiles()或把一个属性文件名为 tika.language.override.properties 到类路径中。确保NGRAM文件在类路径中。


如果你现在运行提卡,应该正确地检测您的语言。

更新:
详细创建一个语言配置所需的步骤。

I need a sample code to help me detect farsi language web pages by apache tika toolkit.

 LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
        String language = identifier.getLanguage();

I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika?

解决方案

Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:

languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

In your example the input is misdetected as li(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works of LanguageIdentifier.

The Farsi language (Persian, ISO 639-1 2-letter code fa) is not recognized by default. If you want Tika to recognize another language, you have to create a language profile first.

For this the following steps are necessary:

  1. Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.

  2. Create an ngram file for the language identifier. This can be done using TikaCLI:

    java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt This will a file called fa.ngp which contains the n-grams.

  3. Configure Tika so that it recognizes the new language. Either do this programmatically using LanguageIdentifier.initProfiles() or put a property file with the name tika.language.override.properties into the classpath. Make sure the ngram file is in the classpath as well.

If you now run Tika, it should correctly detect your language.

Update: Detailed the steps necessary to create a language profile.

这篇关于如何检测波斯语网页由蒂卡?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆