我怎样才能正确地提取标/上标使用iTextSharp的PDF文件? [英] How can I extract subscript / superscript properly from a PDF using iTextSharp?

查看:1338
本文介绍了我怎样才能正确地提取标/上标使用iTextSharp的PDF文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

iTextSharp的作品以及提取PDF文档的纯文本,但我有与标/上标文本,技术文件中常见的麻烦。

iTextSharp works well extracting plain text from PDF documents, but I'm having trouble with subscript/superscript text, common in technical documents.

TextChunk.SameLine()需要两个块具有的相同的垂直定位是上的同一直线上,这是不为标或下标文字的情况。例如,该文件第11页,燃烧效率下的:

TextChunk.SameLine() requires two chunks to have identical vertical positioning to be "on" the same line, which isn't the case for superscript or subscript text. For example, on page 11 of this document, under "COMBUSTION EFFICIENCY":

http://www.mass.gov/courts/docs/lawlib/300-399cmr/310cmr7.pdf

预期文本:

monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO2 /(CO + CO2)]

结果文本:

monoxide (CO) in flue gas in accordance with the following formula: C.E. = [CO /(CO + CO )] 
2 2 

我搬到相同线() LocationTextExtractionStrategy 和私营 TextChunk 属性读取公开干将。这让我调整上飞的公差在我自己的子类,如下所示:

I moved SameLine() to LocationTextExtractionStrategy and made public getters for the private TextChunk properties it reads. This allowed me to adjust the tolerance on the fly in my own subclass, shown here:

public class SubSuperStrategy : LocationTextExtractionStrategy {
  public int SameLineOrientationTolerance { get; set; }
  public int SameLineDistanceTolerance { get; set; }

  public override bool SameLine(TextChunk chunk1, TextChunk chunk2) {
    var orientationDelta = Math.Abs(chunk1.OrientationMagnitude
       - chunk2.OrientationMagnitude);
    if(orientationDelta > SameLineOrientationTolerance) return false;
    var distDelta = Math.Abs(chunk1.DistPerpendicular
       - chunk2.DistPerpendicular);
    return (distDelta <= SameLineDistanceTolerance);
    }
}



使用 SameLineDistanceTolerance 3 ,这个纠正其中的的子/超级块被分配到,但文本的相对位置的方法关:

Using a SameLineDistanceTolerance of 3, this corrects which line the sub/super chunks are assigned to, but the relative position of the text is way off:

monoxide (CO) in flue gas in accordance with the following formula:   C.E. = [CO /(CO + CO )] 2 2

有时候块得到的地方插入文本中间,有时(与本示例)在末端。无论哪种方式,他们没有在正确的地方结束。我怀疑这可能是与字体大小,但我在我的理解这段代码的肠子限制。

Sometimes the chunks get inserted somewhere in the middle of the text, and sometimes (as with this example) at the end. Either way, they don't end up in the right place. I suspect this might have something to do with font sizes, but I'm at my limits of understanding the bowels of this code.

有没有人发现了另一种方式来处理此?

Has anyone found another way to deal with this?

(我很高兴提交pull请求与我的变化是否有帮助。)

(I'm happy to submit a pull request with my changes if that helps.)

推荐答案

要在正确的行提取这些上下标,需要采用不同的方法来检查两个文本块是否在同一行。 。下面的类代表一个这样的方法

To properly extract these subscripts and superscripts in line, one needs a different approach to check whether two text chunks are on the same line. The following classes represent one such approach.

我更在家里的Java / iText的;因此,我实现了在Java中这种方法第一个也是唯一事后它翻译成C#/ iTextSharp的

我使用的是当前开发分支的iText 5.5.8快照。

假设文本行是水平的,并在不同的行的字形的边界框的垂直延伸到不重叠,可以尝试识别使用 RenderListener 这样的行:

Assuming text lines to be horizontal and the vertical extend of the bounding boxes of the glyphs on different lines to not overlap, one can try to identify lines using a RenderListener like this:

public class TextLineFinder implements RenderListener
{
    @Override
    public void beginTextBlock() { }
    @Override
    public void endTextBlock() { }
    @Override
    public void renderImage(ImageRenderInfo renderInfo) { }

    /*
     * @see RenderListener#renderText(TextRenderInfo)
     */
    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        LineSegment ascentLine = renderInfo.getAscentLine();
        LineSegment descentLine = renderInfo.getDescentLine();
        float[] yCoords = new float[]{
                ascentLine.getStartPoint().get(Vector.I2),
                ascentLine.getEndPoint().get(Vector.I2),
                descentLine.getStartPoint().get(Vector.I2),
                descentLine.getEndPoint().get(Vector.I2)
        };
        Arrays.sort(yCoords);
        addVerticalUseSection(yCoords[0], yCoords[3]);
    }

    /**
     * This method marks the given interval as used.
     */
    void addVerticalUseSection(float from, float to)
    {
        if (to < from)
        {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.size(); i++)
        {
            float flip = verticalFlips.get(i);
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.size(); j++)
            {
                flip = verticalFlips.get(j);
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        boolean fromOutsideInterval = i%2==0;
        boolean toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.remove(j);
        if (toOutsideInterval)
            verticalFlips.add(i, to);
        if (fromOutsideInterval)
            verticalFlips.add(i, from);
    }

    final List<Float> verticalFlips = new ArrayList<Float>();
}

TextLineFinder.java

RenderListener 试图通过文本边框投射到y轴的识别水平文本行。它假定这些预测甚至不标和下标时从不同的行文字重叠。

This RenderListener tries to identify horizontal text lines by projecting the text bounding boxes onto the y axis. It assumes that these projections do not overlap for text from different lines, even in case of subscripts and superscripts.

本类本质上是的 PageVerticalAnalyzer 使用这个答案

排序文本块在确定类似上面的线路,一个可以调整的iText的 LocationTextExtractionStrategy 沿着这样的线路进行排序:

Having identified the lines like above, one can tweak iText's LocationTextExtractionStrategy to sort along those lines like this:

public class HorizontalTextExtractionStrategy extends LocationTextExtractionStrategy
{
    public class HorizontalTextChunk extends TextChunk
    {
        public HorizontalTextChunk(String string, Vector startLocation, Vector endLocation, float charSpaceWidth)
        {
            super(string, startLocation, endLocation, charSpaceWidth);
        }

        @Override
        public int compareTo(TextChunk rhs)
        {
            if (rhs instanceof HorizontalTextChunk)
            {
                HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs;
                int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber());
                if (rslt != 0) return rslt;
                return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1));
            }
            else
                return super.compareTo(rhs);
        }

        @Override
        public boolean sameLine(TextChunk as)
        {
            if (as instanceof HorizontalTextChunk)
            {
                HorizontalTextChunk horAs = (HorizontalTextChunk) as;
                return getLineNumber() == horAs.getLineNumber();
            }
            else
                return super.sameLine(as);
        }

        public int getLineNumber()
        {
            Vector startLocation = getStartLocation();
            float y = startLocation.get(Vector.I2);
            List<Float> flips = textLineFinder.verticalFlips;
            if (flips == null || flips.isEmpty())
                return 0;
            if (y < flips.get(0))
                return flips.size() / 2 + 1;
            for (int i = 1; i < flips.size(); i+=2)
            {
                if (y < flips.get(i))
                {
                    return (1 + flips.size() - i) / 2;
                }
            }
            return 0;
        }
    }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        textLineFinder.renderText(renderInfo);

        LineSegment segment = renderInfo.getBaseline();
        if (renderInfo.getRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to 
            Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise());
            segment = segment.transformBy(riseOffsetTransform);
        }
        TextChunk location = new HorizontalTextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(), renderInfo.getSingleSpaceWidth());
        getLocationalResult().add(location);        
    }

    public HorizontalTextExtractionStrategy() throws NoSuchFieldException, SecurityException
    {
        locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
        locationalResultField.setAccessible(true);

        textLineFinder = new TextLineFinder();
    }

    @SuppressWarnings("unchecked")
    List<TextChunk> getLocationalResult()
    {
        try
        {
            return (List<TextChunk>) locationalResultField.get(this);
        }
        catch (IllegalArgumentException | IllegalAccessException e)
        {
            e.printStackTrace();
            throw new RuntimeException(e);
        }
    }

    final Field locationalResultField;
    final TextLineFinder textLineFinder;
}

Horizo​​ntalTextExtractionStrategy.java

TextExtractionStrategy 使用 TextLineFinder 来识别水平文本行然后利用这些信息对文本块进行排序。

This TextExtractionStrategy uses a TextLineFinder to identify horizontal text lines and then uses these information to sort the text chunks.

当心,此代码使用反射来访问私有父类的成员。这可能不是在所有环境中被允许。在这种情况下,只需复制 LocationTextExtractionStrategy ,直接插入代码。

Beware, this code uses reflection to access private parent class members. This might not be allowed in all environments. In such a case, simply copy the LocationTextExtractionStrategy and directly insert the code.

现在可以使用这个文本提取策略,提取与内联标和下标这样的文字:

Now one can use this text extraction strategy to extract the text with inline superscripts and subscripts like this:

String extract(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException
{
    return PdfTextExtractor.getTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy());
}

(从的 ExtractSuperAndSubInLine.java

在OP的文件第11页上的示例文本,在燃烧效率,现在被提取出来是这样的:

The example text on page 11 of the OP's document, under "COMBUSTION EFFICIENCY", now is extracted like this:

monoxide (CO) in flue gas in accordance with the following formula:   C.E. = [CO 2/(CO + CO 2 )] 



使用C#和放大器同样的方法; iTextSharp的



解释,从Java为中心的部分警告和抽样结果仍然适用,这里是代码:

The same approach using C# & iTextSharp

Explanations, warnings, and sample results from the Java-centric section still apply, here is the code:

我使用iTextSharp的5.5.7。

public class TextLineFinder : IRenderListener
{
    public void BeginTextBlock() { }
    public void EndTextBlock() { }
    public void RenderImage(ImageRenderInfo renderInfo) { }

    public void RenderText(TextRenderInfo renderInfo)
    {
        LineSegment ascentLine = renderInfo.GetAscentLine();
        LineSegment descentLine = renderInfo.GetDescentLine();
        float[] yCoords = new float[]{
            ascentLine.GetStartPoint()[Vector.I2],
            ascentLine.GetEndPoint()[Vector.I2],
            descentLine.GetStartPoint()[Vector.I2],
            descentLine.GetEndPoint()[Vector.I2]
        };
        Array.Sort(yCoords);
        addVerticalUseSection(yCoords[0], yCoords[3]);
    }

    void addVerticalUseSection(float from, float to)
    {
        if (to < from)
        {
            float temp = to;
            to = from;
            from = temp;
        }

        int i=0, j=0;
        for (; i<verticalFlips.Count; i++)
        {
            float flip = verticalFlips[i];
            if (flip < from)
                continue;

            for (j=i; j<verticalFlips.Count; j++)
            {
                flip = verticalFlips[j];
                if (flip < to)
                    continue;
                break;
            }
            break;
        }
        bool fromOutsideInterval = i%2==0;
        bool toOutsideInterval = j%2==0;

        while (j-- > i)
            verticalFlips.RemoveAt(j);
        if (toOutsideInterval)
            verticalFlips.Insert(i, to);
        if (fromOutsideInterval)
            verticalFlips.Insert(i, from);
    }

    public List<float> verticalFlips = new List<float>();
}



这些行排序文本块



Sorting text chunks by those lines

public class HorizontalTextExtractionStrategy : LocationTextExtractionStrategy
{
    public class HorizontalTextChunk : TextChunk
    {
        public HorizontalTextChunk(String stringValue, Vector startLocation, Vector endLocation, float charSpaceWidth, TextLineFinder textLineFinder)
            : base(stringValue, startLocation, endLocation, charSpaceWidth)
        {
            this.textLineFinder = textLineFinder;
        }

        override public int CompareTo(TextChunk rhs)
        {
            if (rhs is HorizontalTextChunk)
            {
                HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs;
                int rslt = CompareInts(getLineNumber(), horRhs.getLineNumber());
                if (rslt != 0) return rslt;
                return CompareFloats(StartLocation[Vector.I1], rhs.StartLocation[Vector.I1]);
            }
            else
                return base.CompareTo(rhs);
        }

        public override bool SameLine(TextChunk a)
        {
            if (a is HorizontalTextChunk)
            {
                HorizontalTextChunk horAs = (HorizontalTextChunk) a;
                return getLineNumber() == horAs.getLineNumber();
            }
            else
                return base.SameLine(a);
        }

        public int getLineNumber()
        {
            Vector startLocation = StartLocation;
            float y = startLocation[Vector.I2];
            List<float> flips = textLineFinder.verticalFlips;
            if (flips == null || flips.Count == 0)
                return 0;
            if (y < flips[0])
                return flips.Count / 2 + 1;
            for (int i = 1; i < flips.Count; i+=2)
            {
                if (y < flips[i])
                {
                    return (1 + flips.Count - i) / 2;
                }
            }
            return 0;
        }

        private static int CompareInts(int int1, int int2){
            return int1 == int2 ? 0 : int1 < int2 ? -1 : 1;
        }

        private static int CompareFloats(float float1, float float2)
        {
            return float1 == float2 ? 0 : float1 < float2 ? -1 : 1;
        }

        TextLineFinder textLineFinder;
    }

    public override void RenderText(TextRenderInfo renderInfo)
    {
        textLineFinder.RenderText(renderInfo);

        LineSegment segment = renderInfo.GetBaseline();
        if (renderInfo.GetRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to 
            Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise());
            segment = segment.TransformBy(riseOffsetTransform);
        }
        TextChunk location = new HorizontalTextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), textLineFinder);
        getLocationalResult().Add(location);        
    }

    public HorizontalTextExtractionStrategy()
    {
        locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
        textLineFinder = new TextLineFinder();
    }

    List<TextChunk> getLocationalResult()
    {
        return (List<TextChunk>) locationalResultField.GetValue(this);
    }

    System.Reflection.FieldInfo locationalResultField;
    TextLineFinder textLineFinder;
}



提取文本



Extracting the text

    string extract(PdfReader reader, int pageNo)
    {
        return PdfTextExtractor.GetTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy());
    }



最新动态:在 LocationTextExtractionStrategy



在iText的5.5.9-SNAPSHOT通过1ab350beae148be2a4bef5e663b3d67a004ff9f8提交53526e4854fcb80c86cbc2e113f7a07401dc9a67(重构LocationTextExtractionStrategy ...)(让TextChunkLocation一个可比<>类... )在 LocationTextExtractionStrategy 架构已更改为允许这样的定制,而不需要进行反思。

UPDATE: Changes in LocationTextExtractionStrategy

In iText 5.5.9-SNAPSHOT Commits 53526e4854fcb80c86cbc2e113f7a07401dc9a67 ("Refactor LocationTextExtractionStrategy...") through 1ab350beae148be2a4bef5e663b3d67a004ff9f8 ("Make TextChunkLocation a Comparable<> class...") the LocationTextExtractionStrategy architecture has been changed to allow for customizations like this without the need for reflection.

不幸的是这种变化打破了Horizo​​ntalTextExtractionStrategy上面介绍。对于iText的版本,这些提交后,可以使用以下策略:

Unfortunately this change breaks the HorizontalTextExtractionStrategy presented above. For iText versions after those commits one can use the following strategy:

public class HorizontalTextExtractionStrategy2 extends LocationTextExtractionStrategy
{
    public static class HorizontalTextChunkLocationStrategy implements TextChunkLocationStrategy
    {
        public HorizontalTextChunkLocationStrategy(TextLineFinder textLineFinder)
        {
            this.textLineFinder = textLineFinder;
        }

        @Override
        public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline)
        {
            return new HorizontalTextChunkLocation(baseline.getStartPoint(), baseline.getEndPoint(), renderInfo.getSingleSpaceWidth());
        }

        final TextLineFinder textLineFinder;

        public class HorizontalTextChunkLocation implements TextChunkLocation
        {
            /** the starting location of the chunk */
            private final Vector startLocation;
            /** the ending location of the chunk */
            private final Vector endLocation;
            /** unit vector in the orientation of the chunk */
            private final Vector orientationVector;
            /** the orientation as a scalar for quick sorting */
            private final int orientationMagnitude;
            /** perpendicular distance to the orientation unit vector (i.e. the Y position in an unrotated coordinate system)
             * we round to the nearest integer to handle the fuzziness of comparing floats */
            private final int distPerpendicular;
            /** distance of the start of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system) */
            private final float distParallelStart;
            /** distance of the end of the chunk parallel to the orientation unit vector (i.e. the X position in an unrotated coordinate system) */
            private final float distParallelEnd;
            /** the width of a single space character in the font of the chunk */
            private final float charSpaceWidth;

            public HorizontalTextChunkLocation(Vector startLocation, Vector endLocation, float charSpaceWidth)
            {
                this.startLocation = startLocation;
                this.endLocation = endLocation;
                this.charSpaceWidth = charSpaceWidth;

                Vector oVector = endLocation.subtract(startLocation);
                if (oVector.length() == 0)
                {
                    oVector = new Vector(1, 0, 0);
                }
                orientationVector = oVector.normalize();
                orientationMagnitude = (int)(Math.atan2(orientationVector.get(Vector.I2), orientationVector.get(Vector.I1))*1000);

                // see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html
                // the two vectors we are crossing are in the same plane, so the result will be purely
                // in the z-axis (out of plane) direction, so we just take the I3 component of the result
                Vector origin = new Vector(0,0,1);
                distPerpendicular = (int)(startLocation.subtract(origin)).cross(orientationVector).get(Vector.I3);

                distParallelStart = orientationVector.dot(startLocation);
                distParallelEnd = orientationVector.dot(endLocation);
            }

            public int orientationMagnitude()   {   return orientationMagnitude;    }
            public int distPerpendicular()      {   return distPerpendicular;       }
            public float distParallelStart()    {   return distParallelStart;       }
            public float distParallelEnd()      {   return distParallelEnd;         }
            public Vector getStartLocation()    {   return startLocation;           }
            public Vector getEndLocation()      {   return endLocation;             }
            public float getCharSpaceWidth()    {   return charSpaceWidth;          }

            /**
             * @param as the location to compare to
             * @return true is this location is on the the same line as the other
             */
            public boolean sameLine(TextChunkLocation as)
            {
                if (as instanceof HorizontalTextChunkLocation)
                {
                    HorizontalTextChunkLocation horAs = (HorizontalTextChunkLocation) as;
                    return getLineNumber() == horAs.getLineNumber();
                }
                else
                    return orientationMagnitude() == as.orientationMagnitude() && distPerpendicular() == as.distPerpendicular();
            }

            /**
             * Computes the distance between the end of 'other' and the beginning of this chunk
             * in the direction of this chunk's orientation vector.  Note that it's a bad idea
             * to call this for chunks that aren't on the same line and orientation, but we don't
             * explicitly check for that condition for performance reasons.
             * @param other
             * @return the number of spaces between the end of 'other' and the beginning of this chunk
             */
            public float distanceFromEndOf(TextChunkLocation other)
            {
                float distance = distParallelStart() - other.distParallelEnd();
                return distance;
            }

            public boolean isAtWordBoundary(TextChunkLocation previous)
            {
                /**
                 * Here we handle a very specific case which in PDF may look like:
                 * -.232 Tc [( P)-226.2(r)-231.8(e)-230.8(f)-238(a)-238.9(c)-228.9(e)]TJ
                 * The font's charSpace width is 0.232 and it's compensated with charSpacing of 0.232.
                 * And a resultant TextChunk.charSpaceWidth comes to TextChunk constructor as 0.
                 * In this case every chunk is considered as a word boundary and space is added.
                 * We should consider charSpaceWidth equal (or close) to zero as a no-space.
                 */
                if (getCharSpaceWidth() < 0.1f)
                    return false;

                float dist = distanceFromEndOf(previous);

                return dist < -getCharSpaceWidth() || dist > getCharSpaceWidth()/2.0f;
            }

            public int getLineNumber()
            {
                Vector startLocation = getStartLocation();
                float y = startLocation.get(Vector.I2);
                List<Float> flips = textLineFinder.verticalFlips;
                if (flips == null || flips.isEmpty())
                    return 0;
                if (y < flips.get(0))
                    return flips.size() / 2 + 1;
                for (int i = 1; i < flips.size(); i+=2)
                {
                    if (y < flips.get(i))
                    {
                        return (1 + flips.size() - i) / 2;
                    }
                }
                return 0;
            }

            @Override
            public int compareTo(TextChunkLocation rhs)
            {
                if (rhs instanceof HorizontalTextChunkLocation)
                {
                    HorizontalTextChunkLocation horRhs = (HorizontalTextChunkLocation) rhs;
                    int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber());
                    if (rslt != 0) return rslt;
                    return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1));
                }
                else
                {
                    int rslt;
                    rslt = Integer.compare(orientationMagnitude(), rhs.orientationMagnitude());
                    if (rslt != 0) return rslt;

                    rslt = Integer.compare(distPerpendicular(), rhs.distPerpendicular());
                    if (rslt != 0) return rslt;

                    return Float.compare(distParallelStart(), rhs.distParallelStart());
                }
            }
        }
    }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        textLineFinder.renderText(renderInfo);
        super.renderText(renderInfo);
    }

    public HorizontalTextExtractionStrategy2() throws NoSuchFieldException, SecurityException
    {
        this(new TextLineFinder());
    }

    public HorizontalTextExtractionStrategy2(TextLineFinder textLineFinder) throws NoSuchFieldException, SecurityException
    {
        super(new HorizontalTextChunkLocationStrategy(textLineFinder));

        this.textLineFinder = textLineFinder;
    }

    final TextLineFinder textLineFinder;
}

Horizo​​ntalTextExtractionStrategy2.java

这篇关于我怎样才能正确地提取标/上标使用iTextSharp的PDF文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆