使用Apache Solr索引pdf文件内容 [英] Index pdf file content using Apache Solr

查看:188
本文介绍了使用Apache Solr索引pdf文件内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Solr的 php扩展与Apache Solr进行交互.我正在索引数据库中的数据.我也想索引外部文件(例如PDF,PPTX)的内容.

I'm using Solr's php extension for interacting with Apache Solr. I'm indexing data from the database. I wanted to index contents of external files (like PDFs, PPTX) as well.

建立索引的逻辑是: 假设schema.xml定义了以下字段:

The logic for indexing is: Suppose the schema.xml has the following fields defined:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="created" type="tlong" indexed="true" stored="true" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="filepath" type="text_general" indexed="false" stored="true"/>
<field name="filecontent" type="text_general" indexed="false" stored="true"/>

单个数据库条目可能/可能没有存储文件.

A single database entry may/may not have a file stored.

因此,以下是我的索引代码:

Hence, following is my code for indexing:

$post = stdclass object having the database content
$doc = new SolrInputDocument();
$doc->addField('id', $post->id);
$doc->addField('name', $post->name);
....
....
$res = $client->addDocument($doc);
$client->commit();

接下来,我想将PDF文件的内容添加到与上述相同的solr文档中.

Next, I want to add the contents of the PDF file in the same solr document as above.

这是curl代码:

$ch = curl_init('
http://localhost:8010/solr/update/extract?');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

但是,我想我缺少了一些东西.我阅读了文档,但是我无法找到一种方法来检索文件的内容,然后将其添加到field: filecontent

But, I guess I'm missing something. I read the documentation, but I cannot figure out a way of retrieving the contents of the file and then adding it to the existing solr document in the field: filecontent

编辑#1 : 如果尝试在curl请求中设置literal.id=xyz,它将创建一个具有id=xyz的新Solr文档.我不想创建一个新的Solr文档.我希望将pdf的内容编入索引,并作为字段存储在先前创建的solr文档中.

EDIT #1: If I try to set literal.id=xyz in the curl request, it creates a new solr document having id=xyz. I don't want a new solr document created. I want the contents of the pdf to be indexed and stored as a field in the previously created solr document.

$doc = new SolrInputDocument();//Solr document is created
$doc->addField('id', 98765);//The solr document created above is assigned an id=`98765`
....
....
$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

我希望上述solr文档(id = 98765)具有一个字段,其中pdf的内容被索引为&存储.

I want the above solr document (id = 98765) to have a field in which the contents of the pdf get indexed & stored.

但是cURL请求(如上所述)会创建另一个新文档(使用id = 1).我不要那个.

But the cURL request (as above) creates another new document (with id = 1). I don't want that.

推荐答案

Solr与Apache Tika一起进行提取丰富文档内容并将其添加回Solr文档的处理.

Solr with Apache Tika does the handling of extracting the Contents of the Rich Documents and adding it back to the Solr document.

文档:-

您可能会注意到,尽管您可以搜索 示例文档,当 检索文档.这仅仅是因为内容"字段 由Tika生成的映射到Solr字段,称为文本",即 已索引但未存储.这是通过默认映射规则完成的 solrconfig.xml中的/update/extract处理程序,可以轻松更改或 覆盖.例如,要存储和查看所有元数据和内容, 执行以下操作:

You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

默认schema.xml:-

Default schema.xml :-

<!-- Main body of document extracted by SolrCell.
    NOTE: This field is not indexed by default, since it is also copied to "text"
    using copyField below. This is to save space. Use this field for returning and
    highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

如果要定义其他属性来维护文件内容,请在solrconfig.xml本身中使用fmap.content=filecontent覆盖默认值.

If you are defining a different attribute for maintaining the file contents override the default with fmap.content=filecontent in the solrconfig.xml itself.

fmap.content = attr_content参数会覆盖默认值 fmap.content = text导致将内容添加到attr_content 字段.

The fmap.content=attr_content param overrides the default fmap.content=text causing the content to be added to the attr_content field instead.

如果要在单个文档中对其进行索引,请使用文字前缀,例如literal.id=1&literal.name=Name具有属性

If you want to index it in a single documment use literal prefix e.g. literal.id=1&literal.name=Name with the attributes

$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

这篇关于使用Apache Solr索引pdf文件内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆