Solr阿拉伯语 [英] Solr for Arabic

查看:119
本文介绍了Solr阿拉伯语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Solr为3种语言(阿拉伯语,法语和英语)的文档建立索引,我使用了这个fieldType:

I'm using Solr to index documents in 3 langues(arabic, french and english), I have used this fieldType :

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/> 
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

一切都很好,但是用阿拉伯语,当我发出此请求以搜索类似حقل的单词时,Solr找不到该单词,但是当我将该单词从左到右放在相反的لقح中时,Solr找到了字并返回结果.

Everything was good, but in arabic language when I put this request to search a word like حقل Solr doen't find the word, but when I put the word in oppositeلقح from left to right Solr find the word and return result.

我能得到阿拉伯单词的结果吗?

Can I have result for arabic words ?

推荐答案

在这里,我将丹尼尔的巧妙分析转变为记录的答案.不要为此投票,只需去找他的东西投票:-)

I'm going to turn Daniel's clever analysis here to an answer for the record. Don't vote for this, just go find something of his to vote for :-)

有两种方法可以使RTL文本的方向性不匹配.您可以向后索引它,也可以向后查询它.一个简单的HTML表单查询Solr永远不会搞乱方向性.在这种情况下,khaled正在使用一个库从PDF提取文本,该库使PDF倾向于包含视觉顺序"文本而不是逻辑顺序"文本.因此索引充满了向后的阿拉伯语.要解决此问题,他将不得不想出一个可从pdf提取文本的工作库.

There are two ways to get a directionality mismatch with RTL text. You can be indexing it backwards, or you can be querying it backwards. A simple HTML form querying Solr will never mess up directionality. In this care, khaled was extracting text from a PDF using a library that falls victim to the tendency of PDFs to contain 'visual-order' text rather than 'logical order'. So the index was full of backwards Arabic. To fix this, he will have to come up with a working library that extracts text from pdfs.

强迫Apache Tika使用最新的Apache PDFbox可能会有所帮助,或者他的PDF可能很古怪,即使最新的PDFBox也无法处理它.在这种情况下,他有一个难题.

Forcing Apache Tika to use the latest Apache PDFbox might help, or his PDF may be so quirky that even the latest PDFBox can't handle it. In which case he has a hard problem.

这篇关于Solr阿拉伯语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆