将语言配置文件添加到 Apache Tika [英] Adding language profile to Apache Tika

查看:40
本文介绍了将语言配置文件添加到 Apache Tika的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以请任何设法做到这一点的人解释如何做到这一点:-)

Could please anybody who managed to do that explain how to do that :-)

我需要为我需要添加的语言获取 n-gram 文件吗?

Do I need to get n-gram files for the language I need to add ?

这是创建tika.language.override.properties、添加一些其他lang代码并在classPath上添加lang-code.ngp n-gram文件的问题吗?在这种情况下,我从哪里得到它以及为什么 Tika 不支持更多语言,如果只是这个问题?

Is it a matter of creating tika.language.override.properties, add some other lang codes and add lang-code.ngp n-gram file on the classPath ? In that case, where do I get it and why Tika doesn't support more languages, if it is just a matter of this ?

目前支持这些语言进行语言检测

There are currently these languages supported for language detection

da,de,et,el,en,es,fi,fr,hu,is,it,lt,nl,no,pl,pt,ru,sv,th

和 tika 使用传统的 n-gram 符号

and tika uses traditional n-gram notation

er_ 132232
_de 103517
en_ 82666
et_ 80661
for 65286
_fo 57945
de_ 51382
der 44049
at_ 41915
det 41381
_og 40344
_at 39482
ing 38707
den 36795
og_ 36577
_me 34924
nde 34528

这个 lang 检测应用程序目前支持这些语言,但有一些不同的 n-克文件

This lang detection application currently supports these languages, but has kinda different n-gram files

af  bg  cs  de  en  fa  fr  he  hr  id  ja  ko  ml  ne  no  pl  ro  sk  sq  sw   te  tl  uk   vi     zh-tw ar  bn  da  el  es  fi   gu  hi  hu  it  kn  mk  mr   nl   pa  pt  ru  so   sv  ta  th   tr  ur  zh-cn

JSON 表示法

{"freq":{"D":9246,"E":2445,"F":2510,"G":3299,"A":6930,"B":3706,"C":2451,"L":2519,"M":3951,"N":3334,"O":2514,"H" ....

推荐答案

它看起来像 TIKA-490,应该可以添加新的语言配置文件.TIKA-546 似乎表明它还没有想象的那么容易,同时您需要从 Nutch 的 NGramProfile 工具开始并调整输出.

It looks like as of TIKA-490, it should be possible to add new language profiles. TIKA-546 seems to indicate it isn't yet as easy as it might be, and in the mean time you'll need to start with Nutch's NGramProfile tool and tweak the output.

我建议您尝试使用 Nutch 工具生成文件,然后查看 TIKA-490 上的评论以了解如何使用它们的详细信息.

I'd suggest you try using the Nutch tool to generate the files, then look at the comments on TIKA-490 for details on how to use them.

这篇关于将语言配置文件添加到 Apache Tika的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆