获取PDF中的确切Stringposition [英] Get the exact Stringposition in PDF

查看:227
本文介绍了获取PDF中的确切Stringposition的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图读取一个流,并希望得到每个String的确切位置(坐标)

I tried to read a stream and was hoping to get for each String the exact position (coordinates)

    int size = reader.getXrefSize();

    for (int i = 0; i < size; ++i)
    {
        PdfObject pdfObject = reader.getPdfObject(i);
        if ((pdfObject == null) || !pdfObject.isStream())
            continue;

        PdfStream stream = (PdfStream) pdfObject;
        PdfObject obj = stream.get(PdfName.FILTER);

        if ((obj != null) && obj.toString().equals(PdfName.FLATEDECODE.toString()))
        {
            byte[] codedText = PdfReader.getStreamBytesRaw((PRStream) stream);
            byte[] text = PdfReader.FlateDecode(codedText);
            FileOutputStream o = new FileOutputStream(new File("/home..../Text" + i + ".txt"));
            o.write(text);
            o.flush();
            o.close();
        }

    }

我实际上得到的位置如

......
BT                  
70.9 800.9 Td /F1 14 Tf <01> Tj 
10.1 0 Td <02> Tj               
9.3 0 Td <03> Tj
3.9 0 Td <01> Tj
10.1 0 Td <0405> Tj
18.7 0 Td <060607> Tj
21 0 Td <08090A07> Tj
24.9 0 Td <05> Tj
10.1 0 Td <0B0C0D> Tj
28.8 0 Td <0E> Tj
3.8 0 Td <0F> Tj
8.6 0 Td <090B1007> Tj
29.5 0 Td <0B11> Tj
16.4 0 Td <12> Tj
7.8 0 Td <1307> Tj
12.4 0 Td <14> Tj
7.8 0 Td <07> Tj
3.9 0 Td <15> Tj
7.8 0 Td <16> Tj
7.8 0 Td <07> Tj
3.9 0 Td <17> Tj
10.8 0 Td <0D> Tj
7.8 0 Td <18> Tj
10.9 0 Td <19> Tj
ET
.....

但我不知道哪个字符串适合哪个位置
另一方面,在Itext中我可以得到纯文本

But I don't know which string fits to which position On the other hand in Itext I could just get the plain text with

PdfReader reader = new PdfReader(new FileInputStream("/home/....xxx.pdf"));
PdfTextExtractor extract = new PdfTextExtractor(reader);

但当然没有任何职位....

but of course without any position at all....

那么我怎样才能得到每个文本的确切位置(字符串,字符,...)?

So how can I get the exact position for each text(string,char,...) ?

推荐答案

正如基座和David van Driessche在答案中已经指出的那样,从PDF文件中提取文本并非易事。幸运的是,iText解析器包中的类为您完成了大部分繁重工作。你已经找到了该软件包中的至少一个类, PdfTextExtractor,但是这个类本质上是一个方便的工具,如果你只对你有兴趣,可以使用iText的解析器功能。页面的纯文本。在你的情况下,你必须更强烈地查看该包中的类。

As plinth and David van Driessche already pointed out in their answers, text extration from PDF file is non-trivial. Fortunately the classes in the parser package of iText do most of the heavy lifting for you. You have already found at least one class from that package,PdfTextExtractor,but this class essentially is a convenience utility for using the parser functionality of iText if you're only interested in the plain text of the page. In your case you have to look at the classes in that package more intensely.

使用iText获取文本提取主题信息的起点是第15.3节解析 iText in Action - 2nd Edition 的PDF文件,特别是方法 extractText 示例 ParsingHelloWorld.java

A starting point to get information on the topic of text extraction with iText is section 15.3 Parsing PDFs of iText in Action — 2nd Edition, especially the methodextractTextof the sample ParsingHelloWorld.java:

public void extractText(String src, String dest) throws IOException
{
    PrintWriter out = new PrintWriter(new FileOutputStream(dest));
    PdfReader reader = new PdfReader(src);
    RenderListener listener = new MyTextRenderListener(out);
    PdfContentStreamProcessor processor = new PdfContentStreamProcessor(listener);
    PdfDictionary pageDic = reader.getPageN(1);
    PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
    processor.processContent(ContentByteUtils.getContentBytesForPage(reader, 1), resourcesDic);
    out.flush();
    out.close();
}

使用 RenderListener 实施 MyTextRenderListener.java

public class MyTextRenderListener implements RenderListener
{
    [...]

    /**
     * @see RenderListener#renderText(TextRenderInfo)
     */
    public void renderText(TextRenderInfo renderInfo) {
        out.print("<");
        out.print(renderInfo.getText());
        out.print(">");
    }
}

RenderListener 实现只输出文本, TextRenderInfo 对象它检查提供更多信息:

While thisRenderListenerimplementation merely outputs the text, the TextRenderInfo object it inspects offers way more information:

public LineSegment getBaseline();    // the baseline for the text (i.e. the line that the text 'sits' on)
public LineSegment getAscentLine();  // the ascentline for the text (i.e. the line that represents the topmost extent that a string of the current font could have)
public LineSegment getDescentLine(); // the descentline for the text (i.e. the line that represents the bottom most extent that a string of the current font could have)
public float getRise()             ; // the rise which  represents how far above the nominal baseline the text should be rendered

public String getText();             // the text to render
public int getTextRenderMode();      // the text render mode
public DocumentFont getFont();       // the font
public float getSingleSpaceWidth();  // the width, in user space units, of a single space character in the current font

public List<TextRenderInfo> getCharacterRenderInfos(); // details useful if a listener needs access to the position of each individual glyph in the text render operation

因此,如果你的 RenderListener 除了检查带有 getText()的文本外,还会考虑 getBaseline ()甚至 getAscentLine() getDescentLine()。你有所有的坐标你可能需要。

Thus, if yourRenderListenerin addition to inspecting the text withgetText()also considersgetBaseline()or evengetAscentLine()andgetDescentLine().you have all the coordinates you will likely need.

PS: 中的代码有一个包装类ParsingHelloWorld.extractText() PdfReaderContentParser ,它让您简单给出 PdfReader读取器, int页面, RenderListener renderListener:

PS: There is a wrapper class for the code inParsingHelloWorld.extractText(), PdfReaderContentParser, which allows you to simply write the following given aPdfReader reader, anint page,and aRenderListener renderListener:

PdfReaderContentParser parser = new PdfReaderContentParser(reader);
parser.processContent(page, renderListener);

这篇关于获取PDF中的确切Stringposition的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆