IText 像 pdftotext -layout 一样阅读 PDF? [英] IText reading PDF like pdftotext -layout?

查看:30
本文介绍了IText 像 pdftotext -layout 一样阅读 PDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找最简单的方法来实现一个类似于

Im looking for the easiest way to implement a java solution which is quiet similar to the output of

pdftotext -layout FILE

在Linux机器上.(当然它也应该便宜)

on linux machines. (And of course it should be cheap as well)

我刚刚尝试了一些 IText、PDFBox 和 PDFTextStream 的代码片段.迄今为止最准确的解决方案是 PDFTextStream,它使用 VisualOutputTarget 来获得我的文件的完美表示.

I just tried some code snippets of IText, PDFBox and PDFTextStream. The most accurate solution so far is PDFTextStream which uses the VisualOutputTarget to get a great representation of my file.

所以我的列布局被认为是正确的,我可以使用它.但是IText应该也有解决方案,还是?

So my column layout is recognized correct and I'm able to work with it. But there should be also a solution for IText, or?

我发现的每一个简单的片段都会产生一团糟的纯有序字符串(弄乱行/列/行).是否有任何可能更容易且可能不涉及自己的策略的解决方案?或者是否有我可以使用的开源策略?

Every easy snippet I found produces plain ordered strings which are a mess (mess up row/column/lines). Is there any solution which might be easier and may not involve a own Strategy? Or is there a open Source strategy which i can use?

//我按照 mkl 的指示编写并拥有如下策略对象:

// I followed the instructions of mkl and have written and own strategy object as follows:

package com.test.pdfextractiontest.itext;

import ...


public class MyLocationTextExtractionStrategy implements TextExtractionStrategy {

    /** set to true for debugging */
    static boolean DUMP_STATE = false;

    /** a summary of all found text */
    private final List<TextChunk> locationalResult = new ArrayList<TextChunk>();


    public MyLocationTextExtractionStrategy() {
    }


    @Override
    public void beginTextBlock() {
    }


    @Override
    public void endTextBlock() {
    }

    private boolean startsWithSpace(final String str) {
        if (str.length() == 0) {
            return false;
        }
        return str.charAt(0) == ' ';
    }


    private boolean endsWithSpace(final String str) {
        if (str.length() == 0) {
            return false;
        }
        return str.charAt(str.length() - 1) == ' ';
    }

    private List<TextChunk> filterTextChunks(final List<TextChunk> textChunks, final TextChunkFilter filter) {
        if (filter == null) {
            return textChunks;
        }

        final List<TextChunk> filtered = new ArrayList<TextChunk>();
        for (final TextChunk textChunk : textChunks) {
            if (filter.accept(textChunk)) {
                filtered.add(textChunk);
            }
        }
        return filtered;
    }


    protected boolean isChunkAtWordBoundary(final TextChunk chunk, final TextChunk previousChunk) {
        final float dist = chunk.distanceFromEndOf(previousChunk);

        if (dist < -chunk.getCharSpaceWidth() || dist > chunk.getCharSpaceWidth() / 2.0f) {
            return true;
        }

        return false;
    }

    public String getResultantText(final TextChunkFilter chunkFilter) {
        if (DUMP_STATE) {
            dumpState();
        }

        final List<TextChunk> filteredTextChunks = filterTextChunks(this.locationalResult, chunkFilter);
        Collections.sort(filteredTextChunks);

        final StringBuffer sb = new StringBuffer();
        TextChunk lastChunk = null;
        for (final TextChunk chunk : filteredTextChunks) {

            if (lastChunk == null) {
                sb.append(chunk.text);
            } else {
                if (chunk.sameLine(lastChunk)) {

                    if (isChunkAtWordBoundary(chunk, lastChunk) && !startsWithSpace(chunk.text)
                            && !endsWithSpace(lastChunk.text)) {
                        sb.append(' ');
                    }
                    final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                    for(int i = 0; i<Math.round(dist); i++) {
                        sb.append(' ');
                    }
                    sb.append(chunk.text);
                } else {
                    sb.append('
');
                    sb.append(chunk.text);
                }
            }
            lastChunk = chunk;
        }

        return sb.toString();
    }

返回带有结果文本的字符串.*/@覆盖public String getResultantText() {

eturn a String with the resulting text. */ @Override public String getResultantText() {

        return getResultantText(null);

    }

    private void dumpState() {
        for (final TextChunk location : this.locationalResult) {
            location.printDiagnostics();

            System.out.println();
        }

    }


    @Override
    public void renderText(final TextRenderInfo renderInfo) {
        LineSegment segment = renderInfo.getBaseline();
        if (renderInfo.getRise() != 0) { 

            final Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise());
            segment = segment.transformBy(riseOffsetTransform);
        }
        final TextChunk location =
                new TextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(),
                        renderInfo.getSingleSpaceWidth(),renderInfo);
        this.locationalResult.add(location);
    }

    public static class TextChunk implements Comparable<TextChunk> {
        /** the text of the chunk */
        private final String text;
        /** the starting location of the chunk */
        private final Vector startLocation;
        /** the ending location of the chunk */
        private final Vector endLocation;
        /** unit vector in the orientation of the chunk */
        private final Vector orientationVector;
        /** the orientation as a scalar for quick sorting */
        private final int orientationMagnitude;

        private final TextRenderInfo info;

        private final int distPerpendicular;

        private final float distParallelStart;

        private final float distParallelEnd;
        /** the width of a single space character in the font of the chunk */
        private final float charSpaceWidth;

        public TextChunk(final String string, final Vector startLocation, final Vector endLocation,
                final float charSpaceWidth,final TextRenderInfo ri) {
            this.text = string;
            this.startLocation = startLocation;
            this.endLocation = endLocation;
            this.charSpaceWidth = charSpaceWidth;

            this.info = ri;

            Vector oVector = endLocation.subtract(startLocation);
            if (oVector.length() == 0) {
                oVector = new Vector(1, 0, 0);
            }
            this.orientationVector = oVector.normalize();
            this.orientationMagnitude =
                    (int) (Math.atan2(this.orientationVector.get(Vector.I2), this.orientationVector.get(Vector.I1)) * 1000);

            final Vector origin = new Vector(0, 0, 1);
            this.distPerpendicular = (int) startLocation.subtract(origin).cross(this.orientationVector).get(Vector.I3);

            this.distParallelStart = this.orientationVector.dot(startLocation);
            this.distParallelEnd = this.orientationVector.dot(endLocation);
        }

        public Vector getStartLocation() {
            return this.startLocation;
        }


        public Vector getEndLocation() {
            return this.endLocation;
        }


        public String getText() {
            return this.text;
        }

        public float getCharSpaceWidth() {
            return this.charSpaceWidth;
        }

        private void printDiagnostics() {
            System.out.println("Text (@" + this.startLocation + " -> " + this.endLocation + "): " + this.text);
            System.out.println("orientationMagnitude: " + this.orientationMagnitude);
            System.out.println("distPerpendicular: " + this.distPerpendicular);
            System.out.println("distParallel: " + this.distParallelStart);
        }


        public boolean sameLine(final TextChunk as) {
            if (this.orientationMagnitude != as.orientationMagnitude) {
                return false;
            }
            if (this.distPerpendicular != as.distPerpendicular) {
                return false;
            }
            return true;
        }


        public float distanceFromEndOf(final TextChunk other) {
            final float distance = this.distParallelStart - other.distParallelEnd;
            return distance;
        }

        public float myDistanceFromEndOf(final TextChunk other) {
            final float distance = this.distParallelStart - other.distParallelEnd;
            return distance;
        }


        @Override
        public int compareTo(final TextChunk rhs) {
            if (this == rhs) {
                return 0; // not really needed, but just in case
            }

            int rslt;
            rslt = compareInts(this.orientationMagnitude, rhs.orientationMagnitude);
            if (rslt != 0) {
                return rslt;
            }

            rslt = compareInts(this.distPerpendicular, rhs.distPerpendicular);
            if (rslt != 0) {
                return rslt;
            }

            return Float.compare(this.distParallelStart, rhs.distParallelStart);
        }

        private static int compareInts(final int int1, final int int2) {
            return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
        }


        public TextRenderInfo getInfo() {
            return this.info;
        }

    }


    @Override
    public void renderImage(final ImageRenderInfo renderInfo) {
        // do nothing
    }


    public static interface TextChunkFilter {

        public boolean accept(TextChunk textChunk);
    }


}

如您所见,大多数与原始类相同.我刚刚添加了这个:

As you can see most is the same as the original class. i just added this :

                final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                for(int i = 0; i<Math.round(dist); i++) {
                    sb.append(' ');
                }

到 getResultantText 方法以用空格扩展间隙.但问题来了:

to the getResultantText Method to extend the gaps with spaces. But here is the problem:

距离似乎不准确或不准确.结果看起来像

the distance seems to be inaccurate or inexact. the result looks like

这个:

有谁知道如何计算距离的更好值或值?我认为这是因为原始字体类型是 ArialMT 并且我的编辑器在 courier 中,但是要使用此表,建议我可以将表格拆分到正确的位置以获取我的数据.由于值 usw 的浮动开始和结束,这很困难.

does anyone have an idea how to calculate a better or value for the distance? i think its because the original font type is ArialMT and my editor is in courier, but to work with this sheet its recommended that i can split the table on the correct place to get my data. thats difficult due the floating start and end of an value usw.

:-/

推荐答案

像这样插入空格的方法的问题

The problem with your approach inserting spaces like this

            final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
            for(int i = 0; i<Math.round(dist); i++) {
                sb.append(' ');
            }

是它假设 StringBuffer 中的当前位置正好对应于 lastChunk 的末尾,假设字符宽度宽度为 3 个用户空间单位.不必如此,通常每添加一个字符都会破坏以前的这种对应关系.例如.使用比例字体时,这两行的宽度不同:

is that it assumes that the current position in the StringBuffer exactly corresponds to the end of lastChunk assuming a character width width of 3 user space units. This needs not be the case, generally each addition of characters destroys such a former correspondence. E.g. these two lines have way different widths when using a proportional font:

伊莉莉

MWMWMWM

虽然在 StringBuffer 中,它们占据相同的长度.

while in a StringBuffer they occupy the same length.

因此,您必须查看chunk 开始的位置相对于左页面边框,并相应地向缓冲区添加空格.

Thus, you have to look where chunk starts in relation to the left page border and add spaces to the buffer accordingly.

此外,您的代码完全忽略了行首的可用空间.

Furthermore your code completely ignores free space at the start of lines.

如果将原始方法 getResultantText(TextChunkFilter 替换为以下代码),结果应该会有所改善:

Your results should improve if you replace the original method getResultantText(TextChunkFilter by this code instead:

public String getResultantText(TextChunkFilter chunkFilter){
    if (DUMP_STATE) dumpState();
    
    List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
    Collections.sort(filteredTextChunks);

    int startOfLinePosition = 0;
    StringBuffer sb = new StringBuffer();
    TextChunk lastChunk = null;
    for (TextChunk chunk : filteredTextChunks) {

        if (lastChunk == null){
            insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
            sb.append(chunk.text);
        } else {
            if (chunk.sameLine(lastChunk))
            {
                if (isChunkAtWordBoundary(chunk, lastChunk))
                {
                    insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, !startsWithSpace(chunk.text) && !endsWithSpace(lastChunk.text));
                }
                
                sb.append(chunk.text);
            } else {
                sb.append('
');
                startOfLinePosition = sb.length();
                insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
                sb.append(chunk.text);
            }
        }
        lastChunk = chunk;
    }

    return sb.toString();       
}

void insertSpaces(StringBuffer sb, int startOfLinePosition, float chunkStart, boolean spaceRequired)
{
    int indexNow = sb.length() - startOfLinePosition;
    int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
    int spacesToInsert = indexToBe - indexNow;
    if (spacesToInsert < 1 && spaceRequired)
        spacesToInsert = 1;
    for (; spacesToInsert > 0; spacesToInsert--)
    {
        sb.append(' ');
    }
}

public float pageLeft = 0;
public float fixedCharWidth = 6;

pageLeft 是左页边框的坐标.策略不知道,因此必须明确告知;但在许多情况下,0 是正确的值.

pageLeft is the coordinate of the left page border. The strategy does not know it and, therefore, must be told explicitly; in many cases, though, 0 is the correct value.

或者,可以使用所有块的最小 distParallelStart 值.这会切断左边距,但不需要您注入准确的左页边框值.

Alternatively one could use the minimum distParallelStart value of all chunks. This would cut off the left margin but would not require you to inject the exact left page border value.

fixedCharWidth 是假定的字符宽度.根据相关 PDF 中的文字,不同的值可能更合适.在你的情况下,3 的值似乎比我的 6 好.

fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In your case a value of 3 seems to be better than my 6.

这段代码还有很大的改进空间.例如

There still is a lot of room for improvement in this code. E.g.

  • 它假设没有跨越多个表列的文本块.这个假设通常是正确的,但我看到过奇怪的 PDF,其中正常的字间距是在某个偏移处使用单独的文本块实现的,但列间距由单个块中的单个空格字符表示(跨越一列的结束和下一列的开始)!该空格字符的宽度已被 PDF 图形状态的字间距设置所控制.

  • It assumes that there are no text chunks spanning multiple table columns. This assumption very often is correct, but I have seen weird PDFs in which the normal inter-word spacing has been implemented using separate text chunks at some offset but the inter-column spacing was represented by a single space character in a single chunk (spanning the end of one column and the start of the next)! The width of that space character has been manipulated by the word-spacing setting of the PDF graphics state.

它忽略了不同数量的垂直空间.

It ignores different amounts of vertical space.

这篇关于IText 像 pdftotext -layout 一样阅读 PDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆