获取 PDF 中的确切字符串位置 [英] Get the exact Stringposition in PDF

查看:34
本文介绍了获取 PDF 中的确切字符串位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图读取一个流,并希望获得每个字符串的确切位置(坐标)

I tried to read a stream and was hoping to get for each String the exact position (coordinates)

    int size = reader.getXrefSize();

    for (int i = 0; i < size; ++i)
    {
        PdfObject pdfObject = reader.getPdfObject(i);
        if ((pdfObject == null) || !pdfObject.isStream())
            continue;

        PdfStream stream = (PdfStream) pdfObject;
        PdfObject obj = stream.get(PdfName.FILTER);

        if ((obj != null) && obj.toString().equals(PdfName.FLATEDECODE.toString()))
        {
            byte[] codedText = PdfReader.getStreamBytesRaw((PRStream) stream);
            byte[] text = PdfReader.FlateDecode(codedText);
            FileOutputStream o = new FileOutputStream(new File("/home..../Text" + i + ".txt"));
            o.write(text);
            o.flush();
            o.close();
        }

    }

我实际上得到了这样的职位

I actually got the position like

......
BT                  
70.9 800.9 Td /F1 14 Tf <01> Tj 
10.1 0 Td <02> Tj               
9.3 0 Td <03> Tj
3.9 0 Td <01> Tj
10.1 0 Td <0405> Tj
18.7 0 Td <060607> Tj
21 0 Td <08090A07> Tj
24.9 0 Td <05> Tj
10.1 0 Td <0B0C0D> Tj
28.8 0 Td <0E> Tj
3.8 0 Td <0F> Tj
8.6 0 Td <090B1007> Tj
29.5 0 Td <0B11> Tj
16.4 0 Td <12> Tj
7.8 0 Td <1307> Tj
12.4 0 Td <14> Tj
7.8 0 Td <07> Tj
3.9 0 Td <15> Tj
7.8 0 Td <16> Tj
7.8 0 Td <07> Tj
3.9 0 Td <17> Tj
10.8 0 Td <0D> Tj
7.8 0 Td <18> Tj
10.9 0 Td <19> Tj
ET
.....

但是我不知道哪个字符串适合哪个位置另一方面,在 Itext 中,我可以使用

But I don't know which string fits to which position On the other hand in Itext I could just get the plain text with

PdfReader reader = new PdfReader(new FileInputStream("/home/....xxx.pdf"));
PdfTextExtractor extract = new PdfTextExtractor(reader);

但当然没有任何职位....

but of course without any position at all....

那么我怎样才能获得每个 text(string,char,...) 的确切位置?

So how can I get the exact position for each text(string,char,...) ?

推荐答案

正如 plinth 和 David van Driessche 在他们的回答中已经指出的那样,从 PDF 文件中提取文本并非易事.幸运的是,iText 解析器包中的类为您完成了大部分繁重的工作.您已经从该包中找到了至少一个类,PdfTextExtractor,,但是如果您只对页面的纯文本感兴趣,那么这个类本质上是一个使用 iText 解析器功能的便利实用程序.在您的情况下,您必须更仔细地查看该包中的类.

As plinth and David van Driessche already pointed out in their answers, text extration from PDF file is non-trivial. Fortunately the classes in the parser package of iText do most of the heavy lifting for you. You have already found at least one class from that package,PdfTextExtractor,but this class essentially is a convenience utility for using the parser functionality of iText if you're only interested in the plain text of the page. In your case you have to look at the classes in that package more intensely.

使用 iText 获取文本提取主题信息的起点是 的第 15.3 节 解析 PDFiText in Action — 第 2 版,尤其是示例 ParsingHelloWorld 的方法extractText.java:

A starting point to get information on the topic of text extraction with iText is section 15.3 Parsing PDFs of iText in Action — 2nd Edition, especially the methodextractTextof the sample ParsingHelloWorld.java:

public void extractText(String src, String dest) throws IOException
{
    PrintWriter out = new PrintWriter(new FileOutputStream(dest));
    PdfReader reader = new PdfReader(src);
    RenderListener listener = new MyTextRenderListener(out);
    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
    PdfDictionary pageDic = reader.getPageN(1);
    PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
    processor.processContent(ContentByteUtils.getContentBytesForPage(reader, 1), resourcesDic);
    out.flush();
    out.close();
}

使用 RenderListener 实现 MyTextRenderListener.java:

public class MyTextRenderListener implements RenderListener
{
    [...]

    /**
     * @see RenderListener#renderText(TextRenderInfo)
     */
    public void renderText(TextRenderInfo renderInfo) {
        out.print("<");
        out.print(renderInfo.getText());
        out.print(">");
    }
}

虽然这个RenderListener实现仅仅输出文本,TextRenderInfo 对象提供了更多信息:

While thisRenderListenerimplementation merely outputs the text, the TextRenderInfo object it inspects offers way more information:

public LineSegment getBaseline();    // the baseline for the text (i.e. the line that the text 'sits' on)
public LineSegment getAscentLine();  // the ascentline for the text (i.e. the line that represents the topmost extent that a string of the current font could have)
public LineSegment getDescentLine(); // the descentline for the text (i.e. the line that represents the bottom most extent that a string of the current font could have)
public float getRise()             ; // the rise which  represents how far above the nominal baseline the text should be rendered

public String getText();             // the text to render
public int getTextRenderMode();      // the text render mode
public DocumentFont getFont();       // the font
public float getSingleSpaceWidth();  // the width, in user space units, of a single space character in the current font

public List<TextRenderInfo> getCharacterRenderInfos(); // details useful if a listener needs access to the position of each individual glyph in the text render operation

因此,如果您的RenderListener除了使用getText()检查文本外还考虑getBaseline()甚至getAscentLine()getDescentLine().你有你可能需要的所有坐标.

Thus, if yourRenderListenerin addition to inspecting the text withgetText()also considersgetBaseline()or evengetAscentLine()andgetDescentLine().you have all the coordinates you will likely need.

PS:ParsingHelloWorld.extractText()中的代码有一个包装类,PdfReaderContentParser,它允许您简单地编写以下给定的PdfReader reader,一个int page, 和一个RenderListener renderListener:

PS: There is a wrapper class for the code inParsingHelloWorld.extractText(), PdfReaderContentParser, which allows you to simply write the following given aPdfReader reader, anint page,and aRenderListener renderListener:

PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(page, renderListener);

这篇关于获取 PDF 中的确切字符串位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆