Apache Tika 服务器 - 请求标头参数? [英] Apache Tika Server - Request Header Parameters?

查看:41
本文介绍了Apache Tika 服务器 - 请求标头参数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Tika 服务器提供了一个 Rest API 来从文档中提取文本.也可以设置特定的请求头参数,如 X-Tika-PDFOcrStrategy.例如:

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g:

$ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"

从许多关于 tika 的不同文档中,我发现了这些记录的附加标头参数:

From a lot of different documents about tika I found these documented additional header parameters:

X-Tika-OCRLanguage: eng
X-Tika-PDFextractInlineImages: true | false
X-Tika-PDFOcrStrategy: ocr_only  |  ocr_and_text_extraction
X-Tika-OCRoutputType: hocr

但是似乎没有关于如何使用 X-Tika-...? 头参数或支持哪些参数以及不支持哪些参数的文档.

But there seems to be no documentation about how to use the X-Tika-.....? header parameters or which parameters are supported and which not.

例如,我想知道是否可以使用以下内容覆盖 ImageType 模式或 DPI:

For example I wonder if it is possible to overwrite the ImageType mode or the DPI with something like:

X-Tika-PDFocrImageType: rgb
X-Tika-PDFocrDPI: 100

我的问题是:支持哪些标头参数以及这些参数遵循哪些命名约定?

My question is: Which header parameters are supported and which naming convention did these params follow?

推荐答案

处理 X-Tika-OCRX-Tika-PDF 标题的代码是 TikaResource.processHeaderConfig.

The code that handles the X-Tika-OCR and X-Tika-PDF headers is TikaResource.processHeaderConfig.

然后将这些标题后缀和值映射到 TesseractOCRConfigPDFParserConfig 通过反射配置对象.

Those header suffixes and values are then mapped onto the TesseractOCRConfig and PDFParserConfig configuration objects via reflection.

因此,要查看您可以设置哪些 X-Tika 标头,请查看您要对其进行调整的配置类上的选项 (TesseractPDF),然后构建名称,然后设置标题.如果您不确定该选项的作用或取值,请查看 JavaDocs 以了解将被调用的底层 setter 方法.

So, to see what X-Tika headers you can set, look up the options on the config class you want to tweak things on (Tesseract or PDF), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called.

例如 PDF 上的 setExtractInlineImages,映射到 X-Tika-PDFextractInlineImages

这篇关于Apache Tika 服务器 - 请求标头参数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆