如何通过 tika 检测波斯语网页? [英] how can I detect farsi web pages by tika?

查看:28
本文介绍了如何通过 tika 检测波斯语网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个示例代码来帮助我通过 apache tika 工具包检测波斯语网页.

I need a sample code to help me detect farsi language web pages by apache tika toolkit.

 LanguageIdentifier identifier = new LanguageIdentifier("فارسی");
        String language = identifier.getLanguage();

我已经下载了 apache.tika jar 文件并将它们添加到类路径中.但是此代码为波斯语提供了错误,但它适用于英语.如何将波斯语添加到 tika 的 languageIdentifier 包中?

I have download apache.tika jar files and add them to the classpath. but this code gives error for Farsi language but it works for english. how can I add Farsi to languageIdentifier package of tika?

推荐答案

Tika 尚未附带波斯语的语言配置文件.从 1.0 版开始 支持 27 种语言开箱即用:

Tika doesn't ship with a language profile for the Farsi language yet. As of version 1.0 27 languages are supported out of the box:

languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk

在您的示例中,输入被误检测为 li(立陶宛语),距离为 0.41,高于确定性阈值 0.022.请参阅 源代码 了解更多关于LanguageIdentifier 内部工作的信息.

In your example the input is misdetected as li(Lithuanian) with a distance of 0.41, which is above the certainty threshold of 0.022. See the source code for more information on the inner works of LanguageIdentifier.

波斯语(波斯语,ISO 639-1 2 个字母代码 fa) 默认不被识别.如果您想让 Tika 识别另一种语言,您必须先创建一个语言配置文件.

The Farsi language (Persian, ISO 639-1 2-letter code fa) is not recognized by default. If you want Tika to recognize another language, you have to create a language profile first.

为此需要执行以下步骤:

For this the following steps are necessary:

  1. 为您的语言查找文本语料库.我找到了 Hamshahri 收藏.这应该足够了.下载语料库或其中的一部分,并从 XML 中创建一个纯文本文件.

  1. Find a text corpus for your language. I found the Hamshahri Collection. This should be sufficient. Download the corpus or parts of it and create a plain text file out of the XML.

为语言标识符创建一个 ngram 文件.这可以使用 TikaCLI 来完成:

Create an ngram file for the language identifier. This can be done using TikaCLI:

java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt这将是一个名为 fa.ngp 的文件,其中包含 n-gram.

java -jar tika-app-1.0.jar --create-profile=fa -eUTF-8 fa-corpus.txt This will a file called fa.ngp which contains the n-grams.

配置 Tika 以使其识别新语言.要么使用 LanguageIdentifier.initProfiles() 以编程方式执行此操作,要么将名为 tika.language.override.properties 的属性文件放入类路径中.确保 ngram 文件也在类路径中.

Configure Tika so that it recognizes the new language. Either do this programmatically using LanguageIdentifier.initProfiles() or put a property file with the name tika.language.override.properties into the classpath. Make sure the ngram file is in the classpath as well.

如果您现在运行 Tika,它应该可以正确检测您的语言.

If you now run Tika, it should correctly detect your language.

更新:详细说明创建语言配置文件所需的步骤.

Update: Detailed the steps necessary to create a language profile.

这篇关于如何通过 tika 检测波斯语网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆