是否可以使用Apache Tika逐页提取word/pdf文件的文本? [英] Is it possible to extract text by page for word/pdf files using Apache Tika?
问题描述
我可以找到的所有文档似乎都建议我只能提取整个文件的内容.但是我需要分别提取页面.我需要为此编写自己的解析器吗?我缺少一些明显的方法吗?
All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?
推荐答案
实际上,Tika确实通过在页面开始之前发送元素<div><p>
在页面结束之后发送</p></div>
来处理页面(至少以pdf格式).您可以使用以下方法在处理程序中轻松设置页数(仅使用<p>
来计数页数):
Actually Tika does handle pages (at least in pdf) by sending elements <div><p>
before page starts and </p></div>
after page ends. You can easily setup page count in your handler using this (just counting pages using only <p>
):
public abstract class MyContentHandler implements ContentHandler {
private String pageTag = "p";
protected int pageNumber = 0;
...
@Override
public void startElement (String uri, String localName, String qName, Attributes atts) throws SAXException {
if (pageTag.equals(qName)) {
startPage();
}
}
@Override
public void endElement (String uri, String localName, String qName) throws SAXException {
if (pageTag.equals(qName)) {
endPage();
}
}
protected void startPage() throws SAXException {
pageNumber++;
}
protected void endPage() throws SAXException {
return;
}
...
}
使用pdf进行此操作时,如果解析器未按正确的顺序发送文本行,则可能会遇到问题-请参见
When doing this with pdf you may run into the problem when parser doesn't send text lines in proper order - see Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood) on how to handle this.
这篇关于是否可以使用Apache Tika逐页提取word/pdf文件的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!