Postgresql全文搜索捷克语(无默认语言配置) [英] Postgresql fulltext search for Czech language (no default language config)

查看:41
本文介绍了Postgresql全文搜索捷克语(无默认语言配置)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为捷克语设置全文本搜索.我有点困惑,因为我在 tsearch_data 文件夹中看到了一些 cs_cz.affix cs_cz.dict 文件,但是没有捷克语言配置(它可能未随Postgres一起提供).

I am trying to setup fulltext search for Czech language. I am little bit confused, because I see some cs_cz.affix and cs_cz.dict files inside tsearch_data folder, but there is no Czech language configuration (it's probably not shipped with Postgres).

那么我应该创建一个吗?我必须创建/配置哪些dic?是否完全支持捷克语?我应该使用所有可能的格言吗?(同义词词典,词库词典,伊斯佩尔词典,雪球词典)

So should I create one? Which dics do I have to create/config? Is there some support for Czech language at all? Should I use all possible dicts? (Synonym Dictionary, Thesaurus Dictionary, Ispell Dictionary, Snowball Dictionary)

我能够为 ispell 字典创建捷克语配置,并且工作正常,芽,我不确定是否足够(只是ispell配置).

I am able to create Czech configuration for ispell dict and it works fine, bud I am not sure if it's enough (just ispell configuration).

非常感谢,我尝试阅读 https://www.postgresql.org/docs/9.5/static/textsearch.html ,但我有点困惑.

Thanks a lot I tried to read https://www.postgresql.org/docs/9.5/static/textsearch.html but I am little bit confused.

推荐答案

我从未尝试过,但是只要您准备好从源代码编译PostgreSQL,就应该能够创建Czech Snowball提取器.

I have never tried it, but you should be able to create a Czech Snowball stemmer as long as you are ready to compile PostgreSQL from source.

src/backend/snowball/libstemmer/下的文件和 src/include/snowball/libstemmer/直接从其libstemmer_c中获取分布,仅对文件包含的内容进行一些细微调整.笔记这些文件中的大多数实际上是派生文件,而不是主要源文件.主要资料来源是Snowball语言,可从以下网址获得使用Snowball项目中的Snowball-to-C编译器.我们选择将派生文件包括在PostgreSQL发行版中,因为大多数安装将没有可用的Snowball编译器.

The files under src/backend/snowball/libstemmer/ and src/include/snowball/libstemmer/ are taken directly from their libstemmer_c distribution, with only some minor adjustments of file inclusions. Note that most of these files are in fact derived files, not master source. The master sources are in the Snowball language, and are available along with the Snowball-to-C compiler from the Snowball project. We choose to include the derived files in the PostgreSQL distribution because most installations will not have the Snowball compiler available.

要从新的Snowball libstemmer_c 更新PostgreSQL源分布:

To update the PostgreSQL sources from a new Snowball libstemmer_c distribution:

  1. libstemmer_c/src_c/中的 *.c 文件复制到 src/backend/snowball/libstemmer 例如,用"header.h" 替换"../runtime/header.h" libstemmer_c/src_c/*.c中的f的

  1. Copy the *.c files in libstemmer_c/src_c/ to src/backend/snowball/libstemmer with replacement of "../runtime/header.h" by "header.h", for example

for f in libstemmer_c/src_c/*.c
do
    sed 's|\.\./runtime/header\.h|header.h|' $f >libstemmer/`basename $f`
done

(或者,如果您从主Snowball重建词干分析器文件,源,只需从Snowball编译器开关中省略-r ../runtime" .)

(Alternatively, if you rebuild the stemmer files from the master Snowball sources, just omit "-r ../runtime" from the Snowball compiler switches.)

libstemmer_c/runtime/中的 *.c 文件复制到 src/backend/snowball/libstemmer ,然后对其进行编辑以删除直接包含的内容系统标头,例如< stdio.h> –它们应仅包含"header.h" .(此删除操作避免了在< stdio.h> 对大文件编译选项敏感.)

Copy the *.c files in libstemmer_c/runtime/ to src/backend/snowball/libstemmer, and edit them to remove direct inclusions of system headers such as <stdio.h> – they should only include "header.h". (This removal avoids portability problems on some platforms where <stdio.h> is sensitive to largefile compilation options.)

*.h 文件复制到 libstemmer_c/src_c/ libstemmer_c/runtime/中到 src/include/snowball/libstemmer .在撰写本文时,头文件不需要任何更改.

Copy the *.h files in libstemmer_c/src_c/ and libstemmer_c/runtime/ to src/include/snowball/libstemmer. At this writing the header files do not require any changes.

检查是否已添加或移除任何茎模块.如果是这样,请编辑在 Makefile 中的 OBJS 列表,在 dict_snowball.c 中的 #include 的列表以及 dict_snowball.c 中的 stemmer_modules [] 表.

Check whether any stemmer modules have been added or removed. If so, edit the OBJS list in Makefile, the list of #include's in dict_snowball.c, and the stemmer_modules[] table in dict_snowball.c.

必须下载 stopwords/中的各种停用词文件分别来自 snowball.tartarus.org 网站上的页面.请注意,这些文件必须以UTF-8编码存储.

The various stopword files in stopwords/ must be downloaded individually from pages on the snowball.tartarus.org website. Be careful that these files must be stored in UTF-8 encoding.

现在此处上有一个捷克Snowball提取器,它是对该项目的贡献.没有可用的停用词词典,但我相信您可以找到一个,也可以自己创建一个.

Now there is a Czech Snowball stemmer available here, it was contributed to the project. There is no stop word dictionary available, but I am sure you can either find one or create one yourself.

真正的工作是安装Snowball并使用Snowball-to-C编译器创建C和头文件以添加到PostgreSQL源中.这些文件应该保持稳定,因此升级到新的PostgreSQL版本应该不难.

The real work would be to install Snowball and use the Snowball-to-C compiler to create the C and header files to add to the PostgreSQL source. These files should then remain stable, so it shouldn't be difficult to upgrade to a new PostgreSQL version.

如果您愿意做这项工作,但又不想每次都修补PostgreSQL并从源代码构建它,您也可以考虑将修补程序提交给PostgreSQL.只要词干分析器工作正常,我不希望您在那里遇到阻力(但是补丁提交过程仍然很繁琐).

If you are willing to do the work, but don't want to patch PostgreSQL and build it from source every time, you could also consider submitting a patch to PostgreSQL. As long as the stemmer works fine, I don't expect that you will much resistance there (but the patch submission process is still tedious).

这篇关于Postgresql全文搜索捷克语(无默认语言配置)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆