在PDF中获取图像之前的文本 [英] Get text preceding image in PDF

查看:46
本文介绍了在PDF中获取图像之前的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做的是提取与PDF文件中某些文本关联的图像.例如,PDF将具有房屋正面的照片.在照片上方,将显示一个标题为前视图"的标题.我希望程序在PDF中搜索文本"Front View"并提取其后的照片.

What I'm trying to do is extract the image associated with some text in a PDF file. For instance, a PDF would have a photo of the front of a house. Just above the photo, there would be a caption which reads "Front View". I want the program to search the PDF for the text "Front View" and extract the photo that follows it.

我看过iTextSharp,PDFsharp和其他实用程序,但是它们全部将PDF中的文本和图像分开对待.似乎没有任何方法可以确定这行文本位于该图像之前.

I've looked iTextSharp, PDFsharp, and other utilities, but all of them treat the text in a PDF and the images separately. There doesn't seem to be any way to figure out that this line of text comes before that image.

我们使用iTextSharp来处理PDF.我已经在C#中编写了一种方法,该方法将提取给定页码,页面上图像的编号和图像类型的图像.例如,我可以在第3页上提取第二个jpeg.这是该代码.我想要的是能够在文件中搜索一行文本,然后提取该文本行之后的图像.

We use iTextSharp for manipulating PDFs. I've written a method in C# that will extract an image given a page number, the number of the image on the page, and the image type. For instance, I can extract the 2nd jpeg on page 3. Here is the code for that. What I would like is to be able to search for a line of text in the file and then extract the image that follows that line of text.

public class ImageExtractor : IRenderListener
{
    int _currentPage = 1;
    int _imageCount = 0;
    int _index = 0;
    int _count = 0;
    readonly string _outputFilePrefix;
    readonly string _outputFolder;
    readonly bool _overwriteExistingFiles;
    string[] _fileTypes;

    public ImageExtractor(string outputFilePrefix, string outputFolder, bool overwriteExistingFiles, string[] fileTypes, int index)
    {
        _outputFilePrefix = outputFilePrefix;
        _outputFolder = outputFolder;
        _overwriteExistingFiles = overwriteExistingFiles;
        _fileTypes = fileTypes;
        _index = index;
    }

    public static int ExtractImageByIndex(string pdfPath, string outputFilePrefix, string outputFolder, bool overwriteExistingFiles, int pageNumber, int index, string[] fileTypes = null)
    {
        // Handle setting of any default values
        outputFilePrefix = outputFilePrefix ?? System.IO.Path.GetFileNameWithoutExtension(pdfPath);
        outputFolder = String.IsNullOrEmpty(outputFolder) ? System.IO.Path.GetDirectoryName(pdfPath) : outputFolder;

        var instance = new ImageExtractor(outputFilePrefix, outputFolder, overwriteExistingFiles, fileTypes, index);
        instance._currentPage = pageNumber;

        using (var pdfReader = new PdfReader(pdfPath))
        {
            if (pdfReader.NumberOfPages == 0)
                return 0;

            if (pdfReader.IsEncrypted())
                throw new ApplicationException(pdfPath + " is encrypted.");

            var pdfParser = new PdfReaderContentParser(pdfReader);

            pdfParser.ProcessContent(instance._currentPage, instance);
        }

        return instance._imageCount;
    }

    public void BeginTextBlock() { }
    public void EndTextBlock() { }
    public void RenderText(TextRenderInfo renderInfo) { }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        // If _index is greater than 0, we're looking for a specific image. If _count is
        // equal to _index, we've already found it, so don't go any farther.
        if (_index > 0 && _count == _index)
            return;

        var imageObject = renderInfo.GetImage();

        var imageFileName = "";

        if (_fileTypes != null)
        {
            var type = imageObject.GetFileType().ToLower();
            var flag = false;
            foreach (var t in _fileTypes)
            {
                if (t.ToLower() == type)
                {
                    flag = true;
                    break;
                }
            }
            if (flag)
                imageFileName = String.Format("{0}_{1}_{2}.{3}", _outputFilePrefix, _currentPage, _imageCount, imageObject.GetFileType());
        }
        else
        {
            imageFileName = String.Format("{0}_{1}_{2}.{3}", _outputFilePrefix, _currentPage, _imageCount, imageObject.GetFileType());
        }

        if (!string.IsNullOrEmpty(imageFileName))
        {
            // If _index is 0, multiple images may be extracted. If _index is greater than 0,
            // RenderImage will increment count every time it finds an image that matches the
            // file type and will only extract the image if count equals index.
            if (_index > 0)
            {
                _count++;
                if (_count != _index)
                    return;
            }

            var imagePath = System.IO.Path.Combine(_outputFolder, imageFileName);

            if (_overwriteExistingFiles || !File.Exists(imagePath))
            {
                var imageRawBytes = imageObject.GetImageAsBytes();

                File.WriteAllBytes(imagePath, imageRawBytes);

            }

            // Subtle: Always increment even if file is not written. This ensures consistency should only some
            //   of a PDF file's images actually exist.
            _imageCount++;
        }
    }
}

推荐答案

正如评论中已经提到的,这与问题

As already mentioned in a comment, this is very similar to the topic of the question Extraction of images present inside a paragraph with the main difference that in the context of that question iText for Java was used instead of iTextSharp for .Net.

该问题中Java SimpleMixedExtractionStrategy的端口可能看起来像这样:

A port of the Java SimpleMixedExtractionStrategy from that question might look like this:

public class SimpleMixedExtractionStrategy : LocationTextExtractionStrategy
{
    FieldInfo field = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.Instance | BindingFlags.NonPublic);
    LineSegment UNIT_LINE = new LineSegment(new Vector(0, 0, 1), new Vector(1, 0, 1));
    String outputPath;
    String name;
    int counter = 0;

    public SimpleMixedExtractionStrategy(String outputPath, String name)
    {
        this.outputPath = outputPath;
        this.name = name;
    }

    public override void RenderImage(ImageRenderInfo renderInfo)
    {
        PdfImageObject image = renderInfo.GetImage();
        if (image == null) return;
        int number = counter++;
        String filename = name + "-" + number + "." + image.GetFileType();
        File.WriteAllBytes(outputPath + filename, image.GetImageAsBytes());

        LineSegment segment = UNIT_LINE.TransformBy(renderInfo.GetImageCTM());
        TextChunk location = new TextChunk("[" + filename + "]", segment.GetStartPoint(), segment.GetEndPoint(), 0f);

        List<TextChunk> locationalResult = (List<TextChunk>)field.GetValue(this);
        locationalResult.Add(location);
    }
}

就像在Java实现中一样,有必要使用反射来访问LocationTextExtractionStrategy中的private List<TextChunk> locationalResult.如果您的项目中不允许使用反射,则可以将LocationTextExtractionStrategy的整个源复制到自己的类中,并将更改应用于副本.

Just like in the Java implementation, it is necessary to use reflection to access the private List<TextChunk> locationalResult in LocationTextExtractionStrategy. If the use of reflection is not allowed in your project, you can copy the whole source of LocationTextExtractionStrategy to an own class and apply the changes to the copy.

您可以这样使用它:

String sourceFile = @"SOURCE.pdf";
String imagePath = @"extract\";
String imageBaseName = "SOURCE-";
Directory.CreateDirectory(imagePath);

using (PdfReader pdfReader = new PdfReader(sourceFile))
{
    PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
    for (var i = 1; i <= pdfReader.NumberOfPages; i++)
    {
        SimpleMixedExtractionStrategy listener = new SimpleMixedExtractionStrategy(imagePath, imageBaseName + i);
        parser.ProcessContent(i, listener);
        String text = listener.GetResultantText();
        Console.Write("Text of page {0}:\n---\n{1}\n\n", i, text);
    }
}

有关引述问题的示例文件

For the example file from the referred-to question

输出为:

Text of page 1:
---
Getting Started with Vaadin
• A version of Book of Vaadin that you can browse in the Eclipse Help system.
You can install the plugin as follows:
1. Start Eclipse.
2. Select Help   Software Updates....
3. Select the Available Software tab.
4. Add the Vaadin plugin update site by clicking Add Site....
[book-of-vaadin-page14-1-0.png]
Enter the URL of the Vaadin Update Site: http://vaadin.com/eclipse and click OK. The
Vaadin site should now appear in the Software Updates window.
5. Select all the Vaadin plugins in the tree.
[book-of-vaadin-page14-1-1.png]
Finally, click Install.
Detailed and up-to-date installation instructions for the Eclipse plugin can be found at http://vaad-
in.com/eclipse.
Updating the Vaadin Plugin
If you have automatic updates enabled in Eclipse (see Window   Preferences   Install/Update
  Automatic Updates), the Vaadin plugin will be updated automatically along with other plugins.
Otherwise, you can update the Vaadin plugin (there are actually multiple plugins) manually as
follows:
1. Select Help   Software Updates..., the Software Updates and Add-ons window will
open.
2. Select the Installed Software tab.
14 Vaadin Plugin for Eclipse

因此,为您的任务

我想要的是能够在文件中搜索一行文本,然后提取该文本行之后的图像.

What I would like is to be able to search for a line of text in the file and then extract the image that follows that line of text.

只需在上方的输出字符串中搜索该文本行 ,然后查找下一行包含方括号中图像文件名的行.

simply search for that line of text in the output string above and look for the next line containing an image file name in square brackets.

(如果您的PDF也使用方括号,则可以将文件名包含在SimpleMixedExtractionStrategy中的其他定界符中,例如Unicode专用区域中的某些字符.)

(If your PDF also uses square brackets, you can envelop the file name in other delimiters in the SimpleMixedExtractionStrategy, e.g. some characters from a Unicode private use area.)

这篇关于在PDF中获取图像之前的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆