使用 Apache Solr 索引 pdf 文件内容 [英] Index pdf file content using Apache Solr

查看:60
本文介绍了使用 Apache Solr 索引 pdf 文件内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 Solr 的 php 扩展与 Apache Solr 交互.我正在索引数据库中的数据.我也想索引外部文件(如 PDF、PPTX)的内容.

I'm using Solr's php extension for interacting with Apache Solr. I'm indexing data from the database. I wanted to index contents of external files (like PDFs, PPTX) as well.

索引的逻辑是:假设 schema.xml 定义了以下字段:

The logic for indexing is: Suppose the schema.xml has the following fields defined:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="created" type="tlong" indexed="true" stored="true" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="filepath" type="text_general" indexed="false" stored="true"/>
<field name="filecontent" type="text_general" indexed="false" stored="true"/>

单个数据库条目可能/可能不存储文件.

A single database entry may/may not have a file stored.

因此,以下是我的索引代码:

Hence, following is my code for indexing:

$post = stdclass object having the database content
$doc = new SolrInputDocument();
$doc->addField('id', $post->id);
$doc->addField('name', $post->name);
....
....
$res = $client->addDocument($doc);
$client->commit();

接下来,我想将PDF文件的内容添加到与上面相同的solr文档中.

Next, I want to add the contents of the PDF file in the same solr document as above.

这是curl代码:

$ch = curl_init('
http://localhost:8010/solr/update/extract?');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

但是,我想我错过了一些东西.我阅读了 documentation,但我无法找到检索文件内容的方法,然后将其添加到 字段中的现有 solr 文档:filecontent

But, I guess I'm missing something. I read the documentation, but I cannot figure out a way of retrieving the contents of the file and then adding it to the existing solr document in the field: filecontent

编辑 #1:如果我尝试在 curl 请求中设置 literal.id=xyz,它会创建一个具有 id=xyz 的新 solr 文档.我不想创建新的 solr 文档.我希望 pdf 的内容被索引并存储为先前创建的 solr 文档中的一个字段.

EDIT #1: If I try to set literal.id=xyz in the curl request, it creates a new solr document having id=xyz. I don't want a new solr document created. I want the contents of the pdf to be indexed and stored as a field in the previously created solr document.

$doc = new SolrInputDocument();//Solr document is created
$doc->addField('id', 98765);//The solr document created above is assigned an id=`98765`
....
....
$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

我希望上面的 solr 文档 (id = 98765) 有一个字段,其中 pdf 的内容被索引 &存储.

I want the above solr document (id = 98765) to have a field in which the contents of the pdf get indexed & stored.

但是 cURL 请求(如上)创建了另一个新文档(id = 1).我不想那样.

But the cURL request (as above) creates another new document (with id = 1). I don't want that.

推荐答案

Solr with Apache Tika 处理提取富文档的内容并将其添加回 Solr 文档.

Solr with Apache Tika does the handling of extracting the Contents of the Rich Documents and adding it back to the Solr document.

文档 :-

您可能会注意到,虽然您可以搜索示例文档,您可能无法看到该文本文档被检索.这仅仅是因为内容"字段由 Tika 生成的映射到名为text"的 Solr 字段,即索引但未存储.这是通过默认映射规则完成的solrconfig.xml 中的/update/extract 处理程序,可以轻松更改或被覆盖.例如,要存储和查看所有元数据和内容,执行以下操作:

You may notice that although you can search on any of the text in the sample document, you may not be able to see that text when the document is retrieved. This is simply because the "content" field generated by Tika is mapped to the Solr field called "text", which is indexed but not stored. This is done via the default map rule in the /update/extract handler in solrconfig.xml and can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:

默认 schema.xml :-

Default schema.xml :-

<!-- Main body of document extracted by SolrCell.
    NOTE: This field is not indexed by default, since it is also copied to "text"
    using copyField below. This is to save space. Use this field for returning and
    highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

如果您要定义不同的属性来维护文件内容,请使用 solrconfig.xml 本身中的 fmap.content=filecontent 覆盖默认值.

If you are defining a different attribute for maintaining the file contents override the default with fmap.content=filecontent in the solrconfig.xml itself.

fmap.content=attr_content 参数覆盖默认值fmap.content=text 导致将内容添加到 attr_content字段.

The fmap.content=attr_content param overrides the default fmap.content=text causing the content to be added to the attr_content field instead.

如果您想在单个文档中对其进行索引,请使用文字前缀,例如literal.id=1&literal.name=Name 带有属性

If you want to index it in a single documment use literal prefix e.g. literal.id=1&literal.name=Name with the attributes

$ch = curl_init('
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath));
$result= curl_exec ($ch);

这篇关于使用 Apache Solr 索引 pdf 文件内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆