斯洛文尼亚茎干狮身人面像 [英] Slovenian stemmer for Sphinx

查看:121
本文介绍了斯洛文尼亚茎干狮身人面像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我试图达到的目标是,例如当搜索'jabolka'时, ,我还想要包含'jabolko','jabolki','jabolk'等文件的结果。

我发现了一些关于斯洛文尼亚stemmer存在的参考,但我可以't找到下载它的地方,甚至不在任何地方销售......

我遇到的另一个选项是使用选项 wordforms 在狮身人面像来源配置( http:// sphinxsearch。 com / docs / manual-0.9.9.html#conf-wordforms ),但是构建我自己的字典太困难了,所以我想知道是否有任何可公开访问的字典?






如果没有斯洛文尼亚词干提供者,有人可以提出其他一些实现类似搜索的方法结果?

解决方案

我设法在以下步骤中编译斯洛文尼亚语词干:


  1. 下载 http://snowball.tartarus.org/dist/snowball_code .tgz (雪球的源代码)并将其解压缩

  2. http://snowball.tartarus.org/archives/snowball-discuss/0725.html 并将其保存到文件夹/算法/ slovene中步骤1的解压缩项目。该文件的名称必须是 stem_ISO_8859_2.sbl

  3. 算法采用ISO编码,因此我将其转换为UTF8并保存为 stem_Unicode.sbl (您必须为斯洛文尼亚特殊字符找到utf char代码,例如ČŠŽĆ)
  4. 编辑.txt文件在/ libstemmer文件夹中,并为斯洛文尼亚文添加条目:

      slovene UTF_8,ISO_8859_2 slovene,sl,slv 

    code>


  5. 编辑/ GNUmakefile并添加slovene(一次用于utf语言列表,一次用于ISO_8859_2_algorithms)

  6. 转到文件夹/ libstemmer并运行:

      ./ mkmodules.pl modules.h src_c modules.txt ../mkinc.mak 
    ./mkmodules.pl modules_utf8.h src_c modules_utf8.txt ../mkinc_utf8.mak

    这将生成稍后编译所需的文件。


  7. 运行 make 解压文件)

  8. 如果在编译期间没有错误, d有/ src_c文件夹和代码斯洛文尼亚词干在他们(旁边其他人)

      stem_UTF_8_slovene.c 
    stem_ISO_8859_2_slovene。 c
    ...


  9. 解开最新的狮身人面像并复制雪球上的所有文件项目到sphinx / libstemmer_c文件夹(不包括 libstemmer.o GNUmakefile




  10.   touch新闻自述文件作者ChangeLog 
    autoreconf --force - 安装
    ./configure --with-libstemmer
    make
    make install


    如果一切正常,你应该有斯芬克斯工作的斯芬克斯,你只需要在你的sphinx索引configuratiun(在我的Debian它是在/usr/local/etc/sphinx.conf)启用它:

  11. / p>

      charset_type = utf-8 
    形态= libstemmer_slovene


    希望这有助于某人,我以前没有使用 autoconf 所以我花了一段时间才弄清楚。



    这个slovene stemmer并没有在 http://snowball.tartarus.org ,但是从我的测试来看,它对我的​​项目来说足够好。


    I am searching stemming algorithm for Slovenian language that I can use with Sphinx search.

    What I'm trying to achieve is for example when searching for 'jabolka', I also want results for documents containing 'jabolko', 'jabolki', 'jabolk', etc.

    I found some references about existence of Slovenian stemmer, but I can't find where to download it, it's not even for sale anywhere...

    Another option I've came across is using option wordforms in Sphinx source config (http://sphinxsearch.com/docs/manual-0.9.9.html#conf-wordforms), but building my own dictionary would be too difficult, so I'm wondering are there any publicly accessible dictionaries available already?


    In case Slovenian stemmer is not available, can somebody suggest some other approach of achieving similar search results?

    解决方案

    I managed to compile slovenian stemmer in following steps:

    1. Download http://snowball.tartarus.org/dist/snowball_code.tgz (source code for snowball) and unpack it
    2. Download slovenian algorithm from http://snowball.tartarus.org/archives/snowball-discuss/0725.html and save it to unpacked project from step 1 in folder /algorithms/slovene. Name of the file has to be stem_ISO_8859_2.sbl
    3. Algorithm is in ISO encoding, so I converted it to UTF8 and saved it as stem_Unicode.sbl (you have to find utf char codes for slovenian special chars like ČŠŽĆ)
    4. Edit both of .txt files in /libstemmer folder and add entries for slovenian:

      slovene         UTF_8,ISO_8859_2        slovene,sl,slv
      

    5. Edit /GNUmakefile and add slovene (once to list of languages for utf and once for ISO_8859_2_algorithms)
    6. go to folder /libstemmer and run:

      ./mkmodules.pl modules.h src_c modules.txt ../mkinc.mak
      ./mkmodules.pl modules_utf8.h src_c  modules_utf8.txt ../mkinc_utf8.mak
      

      This will generate files needed for compiling later.

    7. run make (from root of unpacked files)
    8. If there were no errors during compile you should have /src_c folder and code for slovenian stemmer in them (next to others)

      stem_UTF_8_slovene.c
      stem_ISO_8859_2_slovene.c
      ...
      

    9. Unpack latest sphinx and copy all files from your snowball project to sphinx /libstemmer_c folder (excluding libstemmer.o and GNUmakefile)

    10. compile sphinx:

      touch NEWS README AUTHORS ChangeLog
      autoreconf --force --install
      ./configure --with-libstemmer
      make
      make install
      

    11. if all went fine you should have slovene stemmer for sphinx working, you just have to enable it in you sphinx index configuratiun (on my Debian it is in /usr/local/etc/sphinx.conf):

      charset_type = utf-8
      morphology = libstemmer_slovene
      

    Hope this helps someone, I had no prior experience with autoconf so it took me a while to figure this out.

    This slovene stemmer is not officially released on http://snowball.tartarus.org, but from my tests it works good enough for my project.

    这篇关于斯洛文尼亚茎干狮身人面像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆