Tika Parser:排除 PDF 附件 [英] Tika Parser: Exclude PDF Attachments

查看:27
本文介绍了Tika Parser:排除 PDF 附件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个 PDF 文档包含不应由 Tika 提取的附件(此处:joboptions).不应将内容发送到 Solr.有没有办法在 Tika 配置中排除某些(或全部)PDF 附件?

There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config?

推荐答案

实现自定义 org.apache.tika.extractor.DocumentSelector 并将其设置在 ParseContext.使用嵌入文档的元数据调用 DocumentSelector 以决定是否应解析嵌入文档.

Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.

示例文档选择器:

public class CustomDocumentSelector implements DocumentSelector {

  @Override
  public boolean select(Metadata metadata) {
    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
    return resourceName == null || !resourceName.endsWith(".joboptions");
  }
}

在 ParseContext 注册它:

parseContext.set(DocumentSelector.class, new CustomDocumentSelector());

这篇关于Tika Parser:排除 PDF 附件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆