Solr 阿拉伯语 [英] Solr for Arabic

查看:52
本文介绍了Solr 阿拉伯语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Solr 来索引 3 种语言(阿拉伯语、法语和英语)的文档,我使用了这个 fieldType :

I'm using Solr to index documents in 3 langues(arabic, french and english), I have used this fieldType :

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> 
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

一切都很好,但是在阿拉伯语中,当我提出这个请求来搜索一个像 حقل 这样的词时,Solr 没有找到这个词,但是当我把这个词放在对面时لقح 从左到右 Solr 查找单词并返回结果.

Everything was good, but in arabic language when I put this request to search a word like حقل Solr doen't find the word, but when I put the word in oppositeلقح from left to right Solr find the word and return result.

我可以得到阿拉伯语单词的结果吗?

Can I have result for arabic words ?

推荐答案

我将在这里将 Daniel 巧妙的分析转化为一个答案以供记录.不要投票给这个,只是去寻找他的一些东西来投票:-)

I'm going to turn Daniel's clever analysis here to an answer for the record. Don't vote for this, just go find something of his to vote for :-)

有两种方法可以使方向性与 RTL 文本不匹配.您可以向后索引它,也可以向后查询它.查询 Solr 的简单 HTML 表单永远不会弄乱方向性.在这方面,khaled 使用一个库从 PDF 中提取文本,该库成为 PDF 包含视觉顺序"文本而不是逻辑顺序"的趋势的受害者.所以索引中充满了倒叙的阿拉伯语.为了解决这个问题,他必须想出一个可以从 pdf 中提取文本的工作库.

There are two ways to get a directionality mismatch with RTL text. You can be indexing it backwards, or you can be querying it backwards. A simple HTML form querying Solr will never mess up directionality. In this care, khaled was extracting text from a PDF using a library that falls victim to the tendency of PDFs to contain 'visual-order' text rather than 'logical order'. So the index was full of backwards Arabic. To fix this, he will have to come up with a working library that extracts text from pdfs.

强制 Apache Tika 使用最新的 Apache PDFbox 可能会有所帮助,否则他的 PDF 可能非常古怪,即使是最新的 PDFBox 也无法处理.在这种情况下,他遇到了难题.

Forcing Apache Tika to use the latest Apache PDFbox might help, or his PDF may be so quirky that even the latest PDFBox can't handle it. In which case he has a hard problem.

这篇关于Solr 阿拉伯语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆