IText阅读PDF格式如pdftotext -layout? [英] IText reading PDF like pdftotext -layout?

查看:203
本文介绍了IText阅读PDF格式如pdftotext -layout?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种最简单的方法来实现一个类似于输出的安静的java解决方案

Im looking for the easiest way to implement a java solution which is quiet similar to the output of

pdftotext -layout FILE

。 (当然它也应该便宜)

on linux machines. (And of course it should be cheap as well)

我刚试了一些IText,PDFBox和PDFTextStream的代码片段。到目前为止,最准确的解决方案是PDFTextStream,它使用VisualOutputTarget来获得我文件的绝佳表示。

I just tried some code snippets of IText, PDFBox and PDFTextStream. The most accurate solution so far is PDFTextStream which uses the VisualOutputTarget to get a great representation of my file.

所以我的列布局被识别正确,我能够工作用它。
但是IText还应该有一个解决方案,或者?

So my column layout is recognized correct and I'm able to work with it. But there should be also a solution for IText, or?

我发现的每个简单片段都会生成简单的有序字符串,这些字符串很混乱(乱七八糟的行/列/行)。是否有任何解决方案可能更容易,可能不涉及自己的战略?或者是否有我可以使用的开源策略?

Every easy snippet I found produces plain ordered strings which are a mess (mess up row/column/lines). Is there any solution which might be easier and may not involve a own Strategy? Or is there a open Source strategy which i can use?

//我按照mkl的说明编写了自己的策略对象,如下所示:

// I followed the instructions of mkl and have written and own strategy object as follows:

package com.test.pdfextractiontest.itext;

import ...


public class MyLocationTextExtractionStrategy implements TextExtractionStrategy {

    /** set to true for debugging */
    static boolean DUMP_STATE = false;

    /** a summary of all found text */
    private final List<TextChunk> locationalResult = new ArrayList<TextChunk>();


    public MyLocationTextExtractionStrategy() {
    }


    @Override
    public void beginTextBlock() {
    }


    @Override
    public void endTextBlock() {
    }

    private boolean startsWithSpace(final String str) {
        if (str.length() == 0) {
            return false;
        }
        return str.charAt(0) == ' ';
    }


    private boolean endsWithSpace(final String str) {
        if (str.length() == 0) {
            return false;
        }
        return str.charAt(str.length() - 1) == ' ';
    }

    private List<TextChunk> filterTextChunks(final List<TextChunk> textChunks, final TextChunkFilter filter) {
        if (filter == null) {
            return textChunks;
        }

        final List<TextChunk> filtered = new ArrayList<TextChunk>();
        for (final TextChunk textChunk : textChunks) {
            if (filter.accept(textChunk)) {
                filtered.add(textChunk);
            }
        }
        return filtered;
    }


    protected boolean isChunkAtWordBoundary(final TextChunk chunk, final TextChunk previousChunk) {
        final float dist = chunk.distanceFromEndOf(previousChunk);

        if (dist < -chunk.getCharSpaceWidth() || dist > chunk.getCharSpaceWidth() / 2.0f) {
            return true;
        }

        return false;
    }

    public String getResultantText(final TextChunkFilter chunkFilter) {
        if (DUMP_STATE) {
            dumpState();
        }

        final List<TextChunk> filteredTextChunks = filterTextChunks(this.locationalResult, chunkFilter);
        Collections.sort(filteredTextChunks);

        final StringBuffer sb = new StringBuffer();
        TextChunk lastChunk = null;
        for (final TextChunk chunk : filteredTextChunks) {

            if (lastChunk == null) {
                sb.append(chunk.text);
            } else {
                if (chunk.sameLine(lastChunk)) {

                    if (isChunkAtWordBoundary(chunk, lastChunk) && !startsWithSpace(chunk.text)
                            && !endsWithSpace(lastChunk.text)) {
                        sb.append(' ');
                    }
                    final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                    for(int i = 0; i<Math.round(dist); i++) {
                        sb.append(' ');
                    }
                    sb.append(chunk.text);
                } else {
                    sb.append('\n');
                    sb.append(chunk.text);
                }
            }
            lastChunk = chunk;
        }

        return sb.toString();
    }

使用生成的文本生成一个String。
* /
@Override
public String getResultantText(){

eturn a String with the resulting text. */ @Override public String getResultantText() {

        return getResultantText(null);

    }

    private void dumpState() {
        for (final TextChunk location : this.locationalResult) {
            location.printDiagnostics();

            System.out.println();
        }

    }


    @Override
    public void renderText(final TextRenderInfo renderInfo) {
        LineSegment segment = renderInfo.getBaseline();
        if (renderInfo.getRise() != 0) { 

            final Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise());
            segment = segment.transformBy(riseOffsetTransform);
        }
        final TextChunk location =
                new TextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(),
                        renderInfo.getSingleSpaceWidth(),renderInfo);
        this.locationalResult.add(location);
    }

    public static class TextChunk implements Comparable<TextChunk> {
        /** the text of the chunk */
        private final String text;
        /** the starting location of the chunk */
        private final Vector startLocation;
        /** the ending location of the chunk */
        private final Vector endLocation;
        /** unit vector in the orientation of the chunk */
        private final Vector orientationVector;
        /** the orientation as a scalar for quick sorting */
        private final int orientationMagnitude;

        private final TextRenderInfo info;

        private final int distPerpendicular;

        private final float distParallelStart;

        private final float distParallelEnd;
        /** the width of a single space character in the font of the chunk */
        private final float charSpaceWidth;

        public TextChunk(final String string, final Vector startLocation, final Vector endLocation,
                final float charSpaceWidth,final TextRenderInfo ri) {
            this.text = string;
            this.startLocation = startLocation;
            this.endLocation = endLocation;
            this.charSpaceWidth = charSpaceWidth;

            this.info = ri;

            Vector oVector = endLocation.subtract(startLocation);
            if (oVector.length() == 0) {
                oVector = new Vector(1, 0, 0);
            }
            this.orientationVector = oVector.normalize();
            this.orientationMagnitude =
                    (int) (Math.atan2(this.orientationVector.get(Vector.I2), this.orientationVector.get(Vector.I1)) * 1000);

            final Vector origin = new Vector(0, 0, 1);
            this.distPerpendicular = (int) startLocation.subtract(origin).cross(this.orientationVector).get(Vector.I3);

            this.distParallelStart = this.orientationVector.dot(startLocation);
            this.distParallelEnd = this.orientationVector.dot(endLocation);
        }

        public Vector getStartLocation() {
            return this.startLocation;
        }


        public Vector getEndLocation() {
            return this.endLocation;
        }


        public String getText() {
            return this.text;
        }

        public float getCharSpaceWidth() {
            return this.charSpaceWidth;
        }

        private void printDiagnostics() {
            System.out.println("Text (@" + this.startLocation + " -> " + this.endLocation + "): " + this.text);
            System.out.println("orientationMagnitude: " + this.orientationMagnitude);
            System.out.println("distPerpendicular: " + this.distPerpendicular);
            System.out.println("distParallel: " + this.distParallelStart);
        }


        public boolean sameLine(final TextChunk as) {
            if (this.orientationMagnitude != as.orientationMagnitude) {
                return false;
            }
            if (this.distPerpendicular != as.distPerpendicular) {
                return false;
            }
            return true;
        }


        public float distanceFromEndOf(final TextChunk other) {
            final float distance = this.distParallelStart - other.distParallelEnd;
            return distance;
        }

        public float myDistanceFromEndOf(final TextChunk other) {
            final float distance = this.distParallelStart - other.distParallelEnd;
            return distance;
        }


        @Override
        public int compareTo(final TextChunk rhs) {
            if (this == rhs) {
                return 0; // not really needed, but just in case
            }

            int rslt;
            rslt = compareInts(this.orientationMagnitude, rhs.orientationMagnitude);
            if (rslt != 0) {
                return rslt;
            }

            rslt = compareInts(this.distPerpendicular, rhs.distPerpendicular);
            if (rslt != 0) {
                return rslt;
            }

            return Float.compare(this.distParallelStart, rhs.distParallelStart);
        }

        private static int compareInts(final int int1, final int int2) {
            return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
        }


        public TextRenderInfo getInfo() {
            return this.info;
        }

    }


    @Override
    public void renderImage(final ImageRenderInfo renderInfo) {
        // do nothing
    }


    public static interface TextChunkFilter {

        public boolean accept(TextChunk textChunk);
    }


}

尽可能看大多数和原来的班级一样。我刚添加了这个:

As you can see most is the same as the original class. i just added this :

                final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
                for(int i = 0; i<Math.round(dist); i++) {
                    sb.append(' ');
                }

到getResultantText用空格扩展间隙的方法。
但问题出在这里:

to the getResultantText Method to extend the gaps with spaces. But here is the problem:

距离似乎不准确或不准确。结果看起来像

the distance seems to be inaccurate or inexact. the result looks like

this:

有没有人知道如何计算更好的距离或价值?我认为它是因为原始字体类型是ArialMT而我的编辑器是快递的,但是为了使用这张表,我建议我可以将表拆分到正确的位置以获取我的数据。由于浮动开始和结束值usw很难。

does anyone have an idea how to calculate a better or value for the distance? i think its because the original font type is ArialMT and my editor is in courier, but to work with this sheet its recommended that i can split the table on the correct place to get my data. thats difficult due the floating start and end of an value usw.

: - /

推荐答案

你的方法插入这样的空格的问题

The problem with your approach inserting spaces like this

            final Float dist = chunk.distanceFromEndOf(lastChunk)/3;
            for(int i = 0; i<Math.round(dist); i++) {
                sb.append(' ');
            }

是假设中的当前位置StringBuffer 完全对应于 lastChunk 的结尾,假设字符宽度宽度为3个用户空间单位。不一定是这种情况,通常每增加一个字符就会破坏这种以前的对应关系。例如。当使用比例字体时,这两行的宽度不同:

is that it assumes that the current position in the StringBuffer exactly corresponds to the end of lastChunk assuming a character width width of 3 user space units. This needs not be the case, generally each addition of characters destroys such a former correspondence. E.g. these two lines have way different widths when using a proportional font:


ililili

ililili

MWMWMWM

StringBuffer 中,它们占用的长度相同。

while in a StringBuffer they occupy the same length.

因此,您必须查看 chunk 相对于左页边框启动的位置相应地向缓冲区添加空格。

Thus, you have to look where chunk starts in relation to the left page border and add spaces to the buffer accordingly.

此外,您的代码完全忽略行开头的可用空间。

Furthermore your code completely ignores free space at the start of lines.

如果您使用此代码替换原始方法 getResultantText(TextChunkFilter ),您的结果应该会有所改善:

Your results should improve if you replace the original method getResultantText(TextChunkFilter by this code instead:

public String getResultantText(TextChunkFilter chunkFilter){
    if (DUMP_STATE) dumpState();

    List<TextChunk> filteredTextChunks = filterTextChunks(locationalResult, chunkFilter);
    Collections.sort(filteredTextChunks);

    int startOfLinePosition = 0;
    StringBuffer sb = new StringBuffer();
    TextChunk lastChunk = null;
    for (TextChunk chunk : filteredTextChunks) {

        if (lastChunk == null){
            insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
            sb.append(chunk.text);
        } else {
            if (chunk.sameLine(lastChunk))
            {
                if (isChunkAtWordBoundary(chunk, lastChunk))
                {
                    insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, !startsWithSpace(chunk.text) && !endsWithSpace(lastChunk.text));
                }

                sb.append(chunk.text);
            } else {
                sb.append('\n');
                startOfLinePosition = sb.length();
                insertSpaces(sb, startOfLinePosition, chunk.distParallelStart, false);
                sb.append(chunk.text);
            }
        }
        lastChunk = chunk;
    }

    return sb.toString();       
}

void insertSpaces(StringBuffer sb, int startOfLinePosition, float chunkStart, boolean spaceRequired)
{
    int indexNow = sb.length() - startOfLinePosition;
    int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth);
    int spacesToInsert = indexToBe - indexNow;
    if (spacesToInsert < 1 && spaceRequired)
        spacesToInsert = 1;
    for (; spacesToInsert > 0; spacesToInsert--)
    {
        sb.append(' ');
    }
}

public float pageLeft = 0;
public float fixedCharWidth = 6;

pageLeft 是左页的坐标边界。该战略不了解它,因此必须明确告知;但在很多情况下,0是正确的值。

pageLeft is the coordinate of the left page border. The strategy does not know it and, therefore, must be told explicitly; in many cases, though, 0 is the correct value.

或者可以使用所有的最小 distParallelStart 值块。这会切断左边距但不要求你注入精确的左页边框值。

Alternatively one could use the minimum distParallelStart value of all chunks. This would cut off the left margin but would not require you to inject the exact left page border value.

fixedCharWidth 是假定的字符宽度。根据所讨论的PDF中的写作,不同的值可能更适合。在你的情况下,值3似乎比我的6好。

fixedCharWidth is the assumed character width. Depending on the writing in the PDF in question a different value might be more apropos. In your case a value of 3 seems to be better than my 6.

此代码仍有很大的改进空间。例如,

There still is a lot of room for improvement in this code. E.g.


  • 它假设没有跨多个表列的文本块。这种假设通常是正确的,但是我已经看到了奇怪的PDF,其中在一些偏移处使用单独的文本块实现了正常的字间距,但是列间距由单个块中的单个空格字符表示(跨越)一列的结尾和下一列的开头)!该空格字符的宽度已由PDF图形状态的字间距设置操纵。

  • It assumes that there are no text chunks spanning multiple table columns. This assumption very often is correct, but I have seen weird PDFs in which the normal inter-word spacing has been implemented using separate text chunks at some offset but the inter-column spacing was represented by a single space character in a single chunk (spanning the end of one column and the start of the next)! The width of that space character has been manipulated by the word-spacing setting of the PDF graphics state.

它忽略不同的垂直空间量。

It ignores different amounts of vertical space.

这篇关于IText阅读PDF格式如pdftotext -layout?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆