使用iText提取PDF文本 [英] PDF text extraction using iText

查看:213
本文介绍了使用iText提取PDF文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在进行信息提取方面的研究,我们想使用iText。

We are doing research in information extraction, and we would like to use iText.

我们正在探索iText。根据我们评论的文献,iText是最好的工具。是否可以在iText中从每行的pdf中提取文本?我在这里发布了与我相关的stackoverflow中的问题,但它只是读取文本而不是提取它。任何人都可以帮我解决我的问题吗?谢谢。

We are on the process of exploring iText. According to the literature we have reviewed, iText is the best tool to use. Is it possible to extract text from pdf per line in iText? I have read a question post here in stackoverflow related to mine but it just read text not to extract it. Can anyone help me with my problem? Thank you.

推荐答案

像西奥多一样,你可以从pdf中提取文字,就像克里斯指出的那样

Like Theodore said you can extract text from a pdf and like Chris pointed out


只要它实际上是文本(不是轮廓或位图)

as long as it is actually text (not outlines or bitmaps)

最好的办法是购买Bruno Lowagie的书Itext。在第二版中,第15章介绍了提取文本。

Best thing to do is buy Bruno Lowagie's book Itext in action. In the second edition chapter 15 covers extracting text.

但是你可以查看他的网站上的例子。 http://itextpdf.com/examples/iia.php?id=279

But you can look at his site for examples. http://itextpdf.com/examples/iia.php?id=279

你可以解析它来创建一个普通的txt文件。
下面是一个代码示例:

And you can parse it to create a plain txt file. Here is a code example:

/*
 * This class is part of the book "iText in Action - 2nd Edition"
 * written by Bruno Lowagie (ISBN: 9781935182610)
 * For more info, go to: http://itextpdf.com/examples/
 * This example only works with the AGPL version of iText.
 */

package part4.chapter15;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "resources/pdfs/preface.pdf";
    /** The resulting text file. */
    public static final String RESULT = "results/part4/chapter15/preface.txt";

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {
        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            out.println(strategy.getResultantText());
        }
        reader.close();
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }
}

注意许可证


此示例仅适用于iTPL的AGPL版本。

This example only works with the AGPL version of iText.

如果你看看其他示例,它将展示如何省略部分文本或如何提取pdf的部分内容。

If you look at the other examples it will show how to leave out parts of the text or how to extract parts of the pdf.

希望有所帮助。

这篇关于使用iText提取PDF文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆