如何在Java中使用iText从PDF文件中删除页眉和页脚 [英] How to remove headers and footers from PDF file using iText in Java

查看:820
本文介绍了如何在Java中使用iText从PDF文件中删除页眉和页脚的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用PDF iText库将PDF转换为文本.

I am using the PDF iText library to convert PDF to text.

下面是我的代码,可使用Java将PDF转换为文本文件.

Below is my code to convert PDF to text file using Java.

public class PdfConverter {

/** The original PDF that will be parsed. */
public static final String pdfFileName = "jdbc_tutorial.pdf";
/** The resulting text file. */
public static final String RESULT = "preface.txt";

/**
 * Parses a PDF to a plain text file.
 * @param pdf the original PDF
 * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));

    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
        out.println(strategy.getResultantText());
        System.out.println(strategy.getResultantText());
    }
    out.flush();
    out.close();
    reader.close();
}

/**
 * Main method.
 * @param    args    no arguments needed
 * @throws IOException
 */
public static void main(String[] args) throws IOException {
    new PdfConverter().parsePdf(pdfFileName, RESULT);
}
}

以上代码可用于将PDF提取为文本.但是我的要求是忽略页眉和页脚,仅从PDF文件中提取内容.

The above code works for extracting PDF to text. But my requirement is to ignore header and footer and extract only content from PDF file.

推荐答案

由于您的pdf具有页眉和页脚,因此将其标记为工件(如果不是仅将文本或内容放在页眉或页脚的位置) .如果将其标记为工件,则可以使用ParseTaggedPdf将其提取.如果ParseTaggedPdf不起作用,您也可以使用ExtractPageContentArea.您可以检查一些与此相关的示例.

Because your pdf has headers and footers, it would be marked as artifacts(if not its just a text or content placed at the position of a header or footer). If its marked as artifacts, you can extract it using ParseTaggedPdf. You can also make use of ExtractPageContentArea if ParseTaggedPdf doesn't work. You can check for a few examples related to it.

以上解决方案是常规的,并且取决于文件.如果您确实需要替代解决方案,则可以使用apache API(例如PdfBox,tika)和其他(例如PDFTextStream).如果您必须坚持使用iText并且无法继续使用其他库,那么我在下面提供的解决方案将无法工作.在PdfBox中,您可以使用PDFTextStripperByArea或PDFTextStripper.如果您需要了解如何使用JavaDoc,请看一下JavaDoc或一些示例.

The above solution is general and depends on the file. If you really need an alternate solution, you can use apache API's like PdfBox, tika and others like PDFTextStream. The solution which i'm giving below wont work if you have to persist with iText and can't move on to other libraries. In PdfBox you can use PDFTextStripperByArea or PDFTextStripper. Look at the JavaDoc or some examples if you need to know how to use it.

这篇关于如何在Java中使用iText从PDF文件中删除页眉和页脚的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆