使用iText7从PDF提取文本.如何提高其性能? [英] Text extraction from a PDF using iText7. How to improve its performance?

查看:297
本文介绍了使用iText7从PDF提取文本.如何提高其性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我使用此代码从矩形(区域)中提取文本.

public static class ReaderExtensions
{
    public static string ExtractText(this PdfPage page, Rectangle rect)
    {
        var filter = new IEventFilter[1];
        filter[0] = new TextRegionEventFilter(rect);
        var filteredTextEventListener = new FilteredTextEventListener(new LocationTextExtractionStrategy(), filter);
        var str = PdfTextExtractor.GetTextFromPage(page, filteredTextEventListener);
        return str;
    }
}

它可以工作,但是我不知道这是否是最好的方法.

此外,我不知道iText团队是否可以改进GetTextFromPage以提高其性能,因为我正在处理大型PDF中的数百个页面,并且使用我当前的配置通常需要十多分钟的时间. /p>

从评论:看来iText可以一次通过提取同一页面上多个矩形的文本,这可以提高性能(分批操作往往更有效),但是如何呢?

更多细节!

我的目标是从具有多个页面的PDF中提取数据.每个页面都有相同的布局:一个包含行和列的表.

当前,我正在使用上述方法提取每个矩形的文本.但是,正如您所看到的,提取不是批量的.一次只是一个矩形.如何单次提取页面的所有矩形?

解决方案

正如评论中已经提到的,我很惊讶地看到iText 7 LocationTextExtractionStrategy不再包含类似于iText 5 LocationTextExtractionStrategy方法的内容. GetResultantText(TextChunkFilter).这样一来,您就可以解析该页面一次,并且可以直接从任意页面区域的文本片段中提取文本.

但是可以恢复该功能.一种选择是将其添加到LocationTextExtractionStrategy的副本中.不过,这将是一个很长的答案.因此,我使用了另一种选择:我使用现有的LocationTextExtractionStrategy,仅对于GetResultantText调用,我就操纵了该策略的文本块的基础列表.我没有使用通用的TextChunkFilter接口,而是将过滤限制在手头的标准上,而是按矩形区域进行过滤.

public static class ReaderExtensions
{
    public static string[] ExtractText(this PdfPage page, params Rectangle[] rects)
    {
        var textEventListener = new LocationTextExtractionStrategy();
        PdfTextExtractor.GetTextFromPage(page, textEventListener);
        string[] result = new string[rects.Length];
        for (int i = 0; i < result.Length; i++)
        {
            result[i] = textEventListener.GetResultantText(rects[i]);
        }
        return result;
    }

    public static String GetResultantText(this LocationTextExtractionStrategy strategy, Rectangle rect)
    {
        IList<TextChunk> locationalResult = (IList<TextChunk>)locationalResultField.GetValue(strategy);
        List<TextChunk> nonMatching = new List<TextChunk>();
        foreach (TextChunk chunk in locationalResult)
        {
            ITextChunkLocation location = chunk.GetLocation();
            Vector start = location.GetStartLocation();
            Vector end = location.GetEndLocation();
            if (!rect.IntersectsLine(start.Get(Vector.I1), start.Get(Vector.I2), end.Get(Vector.I1), end.Get(Vector.I2)))
            {
                nonMatching.Add(chunk);
            }
        }
        nonMatching.ForEach(c => locationalResult.Remove(c));
        try
        {
            return strategy.GetResultantText();
        }
        finally
        {
            nonMatching.ForEach(c => locationalResult.Add(c));
        }
    }

    static FieldInfo locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);
}

中央扩展名是LocationTextExtractionStrategy扩展名,它采用LocationTextExtractionStrategy,该LocationTextExtractionStrategy已经包含来自页面的信息,将这些信息限制为给定矩形中的信息,提取文本,然后将信息返回到先前的状态.这需要一些反思.我希望你没事.

Currently, I use this code to extract text from a Rectangle (area).

public static class ReaderExtensions
{
    public static string ExtractText(this PdfPage page, Rectangle rect)
    {
        var filter = new IEventFilter[1];
        filter[0] = new TextRegionEventFilter(rect);
        var filteredTextEventListener = new FilteredTextEventListener(new LocationTextExtractionStrategy(), filter);
        var str = PdfTextExtractor.GetTextFromPage(page, filteredTextEventListener);
        return str;
    }
}

It works, but I don't know if it's the best way to do it.

Also, I wonder if the GetTextFromPage could be improved by the iText team to increase its performance, since I'm processing hundreds of pages in big PDFs and it usually takes more than 10 minutes to do it using my current configuration.

EDIT:

From the comments: It seems that iText can extract the text of multiple rectangles on the same page in one pass, something that can improve the performance (batched operations tend to be more efficient), but how?

MORE DETAILS!

My goal is to extract data from a PDF with multiple pages. Each page has the same layout: a table with rows and columns.

Currently, I'm using the method above to extract the text of each rectangle. But, as you see, the extraction isn't batched. It's only a rectangle at a time. How could I extract all the rectangles of a page in a single pass?

解决方案

As already mentioned in a comment, I was surprised to see that the iText 7 LocationTextExtractionStrategy does not anymore contain something akin to the iText 5 LocationTextExtractionStrategy method GetResultantText(TextChunkFilter). This would have allowed you to parse the page once and extract text from text pieces in arbitrary page areas out of the box.

But it is possible to bring back that feature. One option for this would be to add it to a copy of the LocationTextExtractionStrategy. This would be kind of a long answer here, though. So I used another option: I use the existing LocationTextExtractionStrategy, and merely for the GetResultantText call I manipulate the underlying list of text chunks of the strategy. Instead of a generic TextChunkFilter interface I restricted filtering to the criteria at hand, the filtering by rectangular area.

public static class ReaderExtensions
{
    public static string[] ExtractText(this PdfPage page, params Rectangle[] rects)
    {
        var textEventListener = new LocationTextExtractionStrategy();
        PdfTextExtractor.GetTextFromPage(page, textEventListener);
        string[] result = new string[rects.Length];
        for (int i = 0; i < result.Length; i++)
        {
            result[i] = textEventListener.GetResultantText(rects[i]);
        }
        return result;
    }

    public static String GetResultantText(this LocationTextExtractionStrategy strategy, Rectangle rect)
    {
        IList<TextChunk> locationalResult = (IList<TextChunk>)locationalResultField.GetValue(strategy);
        List<TextChunk> nonMatching = new List<TextChunk>();
        foreach (TextChunk chunk in locationalResult)
        {
            ITextChunkLocation location = chunk.GetLocation();
            Vector start = location.GetStartLocation();
            Vector end = location.GetEndLocation();
            if (!rect.IntersectsLine(start.Get(Vector.I1), start.Get(Vector.I2), end.Get(Vector.I1), end.Get(Vector.I2)))
            {
                nonMatching.Add(chunk);
            }
        }
        nonMatching.ForEach(c => locationalResult.Remove(c));
        try
        {
            return strategy.GetResultantText();
        }
        finally
        {
            nonMatching.ForEach(c => locationalResult.Add(c));
        }
    }

    static FieldInfo locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance);
}

The central extension is the LocationTextExtractionStrategy extension which takes a LocationTextExtractionStrategy which already contains the information from a page, restricts these information to those in a given rectangle, extracts the text, and returns the information to the previous state. This requires some reflection; I hope that is ok for you.

这篇关于使用iText7从PDF提取文本.如何提高其性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆