提取段落中存在的图像 [英] Extraction of images present inside a paragraph

查看:96
本文介绍了提取段落中存在的图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个应用程序,我需要解析由系统生成的pdf以及我需要填充我的应用程序数据库列的解析信息,但不幸的是我正在处理的pdf结构有一个名为comments的列它既有文字又有图像。我找到了从pdf中分别阅读图像和文本的方法,但我的最终目的是在解析的内容中添加一个类似于{2}的占位符,以及每当我的解析器(应用程序代码)解析此行时系统将在该区域中呈现适当的图像,该图像也存储在我的应用程序内的单独表中。
请帮我解决这个问题。

I am building an application where i need to parse a pdf which is generated by a system and with that parsed information i need to populate my applications database columns but unfortunaltely the pdf structure that i am dealing with is having a column called comments which has both text and image. I found the way of reading the images and text separately from the pdf but my ultimate aim was to add a place holder something like {2} in the place of image inside the parsed content and whenever my parser ( the application code ) parse this line the system will render the appropriate image in that area which is also stored in a separate table inside my application. Please help me with resolving this problem.

提前致谢。

推荐答案

正如评论中已经提到的,解决方案是基本上使用自定义文本提取策略在图像坐标处插入[2]文本块。

As already mentioned in comments, a solution would be to essentially use a customized text extraction strategy to insert a "[ 2]" text chunk at the coordinates of the image.

你可以例如像这样扩展 LocationTextExtractionStrategy

You can e.g. extend the LocationTextExtractionStrategy like this:

class SimpleMixedExtractionStrategy extends LocationTextExtractionStrategy
{
    SimpleMixedExtractionStrategy(File outputPath, String name)
    {
        this.outputPath = outputPath;
        this.name = name;
    }

    @Override
    public void renderImage(final ImageRenderInfo renderInfo)
    {
        try
        {
            PdfImageObject image = renderInfo.getImage();
            if (image == null) return;
            int number = counter++;
            final String filename = String.format("%s-%s.%s", name, number, image.getFileType());
            Files.write(new File(outputPath, filename).toPath(), image.getImageAsBytes());

            LineSegment segment = UNIT_LINE.transformBy(renderInfo.getImageCTM());
            TextChunk location = new TextChunk("[" + filename + "]", segment.getStartPoint(), segment.getEndPoint(), 0f);

            Field field = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult");
            field.setAccessible(true);
            List<TextChunk> locationalResult = (List<TextChunk>) field.get(this);
            locationalResult.add(location);
        }
        catch (IOException | NoSuchFieldException | SecurityException | IllegalArgumentException | IllegalAccessException ioe)
        {
            ioe.printStackTrace();
        }
    }

    final File outputPath;
    final String name; 
    int counter = 0;

    final static LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1) , new Vector(1, 0, 1));
}

(不幸的是,对于这类工作, LocationTextExtractionStrategy 是私有的。因此,我使用了一些Java反射。或者你可以复制整个类并相应地更改你的副本。)

(Unfortunately for this kind of work, some members of LocationTextExtractionStrategy are private. Thus, I used some Java reflection. Alternatively you can copy the whole class and change your copy accordingly.)

使用该策略,您可以提取如下所示的混合内容:

Using that strategy you can extract mixed contents like this:

@Test
public void testSimpleMixedExtraction() throws IOException
{
    InputStream resourceStream = getClass().getResourceAsStream("book-of-vaadin-page14.pdf");
    try
    {
        PdfReader reader = new PdfReader(resourceStream);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        SimpleMixedExtractionStrategy listener = new SimpleMixedExtractionStrategy(OUTPUT_PATH, "book-of-vaadin-page14");
        parser.processContent(1, listener);
        Files.write(new File(OUTPUT_PATH, "book-of-vaadin-page14.txt").toPath(), listener.getResultantText().getBytes());
    }
    finally
    {
        if (resourceStream != null)
            resourceStream.close();
    }
}

例如。对于我的测试文件(包含Book of Vaadin的第14页):

E.g. for my test file (which contains page 14 of the Book of Vaadin):

你得到这个文本

Getting Started with Vaadin
• A version of Book of Vaadin that you can browse in the Eclipse Help system.
You can install the plugin as follows:
1. Start Eclipse.
2. Select Help   Software Updates....
3. Select the Available Software tab.
4. Add the Vaadin plugin update site by clicking Add Site....
[book-of-vaadin-page14-0.png]
Enter the URL of the Vaadin Update Site: http://vaadin.com/eclipse and click OK. The
Vaadin site should now appear in the Software Updates window.
5. Select all the Vaadin plugins in the tree.
[book-of-vaadin-page14-1.png]
Finally, click Install.
Detailed and up-to-date installation instructions for the Eclipse plugin can be found at http://vaad-
in.com/eclipse.
Updating the Vaadin Plugin
If you have automatic updates enabled in Eclipse (see Window   Preferences   Install/Update
  Automatic Updates), the Vaadin plugin will be updated automatically along with other plugins.
Otherwise, you can update the Vaadin plugin (there are actually multiple plugins) manually as
follows:
1. Select Help   Software Updates..., the Software Updates and Add-ons window will
open.
2. Select the Installed Software tab.
14 Vaadin Plugin for Eclipse

和两张图片book-of-vaadin-page14-0 .png

and two images book-of-vaadin-page14-0.png

和book-of-vaadin-page14-1.png

and book-of-vaadin-page14-1.png

in OUTPUT_PATH

同样已经在评论中提到,这个解决方案适用于图像上面和/或下面但是左右都没有文字的简单情况。

As also already mentioned in comments, this solution is for the easy situation in which the image has text above and/or below but neither left nor right.

如果还有文字和/或者也是正确的问题,上面的代码计算 LineSegment段作为图像的底线,但文本策略通常适用于文本的基线位于底线之上。

If there is text left and/or right, too, there is the problem that the code above calculates LineSegment segment as the bottom line of the image but the text strategy usually works with the base line of text which is above the bottom line.

但在这种情况下首先必须决定哪一行在哪一行一个人希望文本中的标记无论如何。决定之后,可以调整上面的来源。

But in this case one first has to decide at which position on which line one wants the marker in the text to be anyways. Having decided that, one can adapt the source above.

这篇关于提取段落中存在的图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆