从PDF中提取不可选择的内容 [英] Extract unselectable content from PDF

查看:134
本文介绍了从PDF中提取不可选择的内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Apache PDFBox从PDF文件中提取页面,我找不到提取不可选内容(文本或图像)的方法。使用可从PDF文件中选择的内容没有问题。



请注意,有问题的PDF对复制内容没有任何限制,至少从我在文件的文档限制摘要中看到的内容:它们都允许内容复制和内容复制可访问性!在同一PDF文件中,有可选择的内容和其他不可访问的部分。所发生的是,提取的页面带有洞,即它们只有PDF的可选部分。但是在MS Word上,如果我将PDF添加为对象,则会显示PDF页面的全部内容!所以我希望对PDFBox lib或任何其他Java lib做同样的事情!



这是我用来将PDF页面转换为图像的代码:

  private void convertPdfToImage(File pdfFile,int pdfId)throws IOException {
PDDocument document = PDDocument.loadNonSeq(pdfFile,空值);
列表< PDPage> pdPages = document.getDocumentCatalog()。getAllPages();
for(PDPage pdPage:pdPages){
BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB,300);
ImageIOUtil.writeImage(bim,TEMP_FILEPATH + pdfId +。png,300);
}
document.close();
}

有没有办法从这个Apache PDFBox库中提取PDF中的不可选内容(或与任何其他类似的图书馆)?或者这根本不可能?如果确实不是,为什么?



非常感谢任何帮助!



EDIT :我使用Adobe Reader作为PDF查看器和PDFBox v1.8。以下是PDF示例:


  • testDrJorge-0-R38-R37.png






  • 图像丢失了红色部分。这很可能是因为PDFBox版本1.x.x不能正确支持CMYK图像的提取,参见


  • testDrJorge-0-COSName {R38} -COSName {R37} .png






  • 看起来像是一种改进......;)


    I'm using Apache PDFBox to extract pages from PDF files and I can't find a way to extract content that is unselectable (either text or images). With content that is selectable from within the PDF files there is no problem.

    Note that the PDFs in question dont have any restrictions regarding copying content, at least from what I saw on the files's "Document Restrictions Summary": they all have "Content Copying" and "Content Copying for Accessbility" allowed! On the same PDF file there is content that is selectable and other parts that aren't. What happens is that, the extracted pages come with "holes", i.e., they only have the selectable parts of the PDF. On MS Word though, if I add the PDFs as objects, the whole content of the PDF pages appear! So I was hoping to do the same with PDFBox lib or any other Java lib for that matter!

    Here is the code I'm using to convert PDF pages to images:

    private void convertPdfToImage(File pdfFile, int pdfId) throws IOException {
       PDDocument document = PDDocument.loadNonSeq(pdfFile, null);
       List<PDPage> pdPages = document.getDocumentCatalog().getAllPages();
       for (PDPage pdPage : pdPages) { 
           BufferedImage bim = pdPage.convertToImage(BufferedImage.TYPE_INT_RGB, 300);
           ImageIOUtil.writeImage(bim, TEMP_FILEPATH + pdfId + ".png", 300);
       }
       document.close();
    }
    

    Is there a way to extract unselectable content from an PDF with this Apache PDFBox library (or with any of the other similar libraries)? Or this is not possible at all? And if indeed it's not, why?

    Much appreciated for any help!

    EDIT: I'm using Adobe Reader as PDF viewer and PDFBox v1.8. Here is a sample PDF: https://dl.dropboxusercontent.com/u/2815529/test.pdf

    解决方案

    The two images in question, the fischer logo in the upper right and the small sketch a bit down, are each drawn by filling a region on the page with a tiling pattern which in turn in its content stream draws the respective image.

    Adobe Reader does not allow to select contents of patterns, and automatic image extractors often do not walk the Pattern resource tree either.

    PDFBox 1.8.10

    You can use PDFBox to fairly easily build a pattern image extractor, e.g. for PDFBox 1.8.10:

    public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
    {
        List<PDPage> pages = document.getDocumentCatalog().getAllPages();
        if (pages == null)
            return;
    
        for (int i = 0; i < pages.size(); i++)
        {
            String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
            extractPatternImages(pages.get(i), pageFormat);
        }
    }
    
    public void extractPatternImages(PDPage page, String pageFormat) throws IOException
    {
        PDResources resources = page.getResources();
        if (resources == null)
            return;
        Map<String, PDPatternResources> patterns = resources.getPatterns();
    
        for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
        {
            String patternFormat = String.format(pageFormat, "-" + patternEntry.getKey() + "%s", "%s");
            extractPatternImages(patternEntry.getValue(), patternFormat);
        }
    }
    
    public void extractPatternImages(PDPatternResources pattern, String patternFormat) throws IOException
    {
        COSDictionary resourcesDict = (COSDictionary) pattern.getCOSDictionary().getDictionaryObject(COSName.RESOURCES);
        if (resourcesDict == null)
            return;
        PDResources resources = new PDResources(resourcesDict);
        Map<String, PDXObject> xObjects = resources.getXObjects();
        if (xObjects == null)
            return;
    
        for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
        {
            PDXObject xObject = entry.getValue();
            String xObjectFormat = String.format(patternFormat, "-" + entry.getKey() + "%s", "%s");
            if (xObject instanceof PDXObjectForm)
                extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
            else if (xObject instanceof PDXObjectImage)
                extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
        }
    }
    
    public void extractPatternImages(PDXObjectForm form, String imageFormat) throws IOException
    {
        PDResources resources = form.getResources();
        if (resources == null)
            return;
        Map<String, PDXObject> xObjects = resources.getXObjects();
        if (xObjects == null)
            return;
    
        for (Map.Entry<String, PDXObject> entry : xObjects.entrySet())
        {
            PDXObject xObject = entry.getValue();
            String xObjectFormat = String.format(imageFormat, "-" + entry.getKey() + "%s", "%s");
            if (xObject instanceof PDXObjectForm)
                extractPatternImages((PDXObjectForm)xObject, xObjectFormat);
            else if (xObject instanceof PDXObjectImage)
                extractPatternImages((PDXObjectImage)xObject, xObjectFormat);
        }
    
        Map<String, PDPatternResources> patterns = resources.getPatterns();
    
        for (Map.Entry<String, PDPatternResources> patternEntry : patterns.entrySet())
        {
            String patternFormat = String.format(imageFormat, "-" + patternEntry.getKey() + "%s", "%s");
            extractPatternImages(patternEntry.getValue(), patternFormat);
        }
    }
    
    public void extractPatternImages(PDXObjectImage image, String imageFormat) throws IOException
    {
        image.write2OutputStream(new FileOutputStream(String.format(imageFormat, "", image.getSuffix())));
    }
    

    (ExtractPatternImages.java)

    I applied it to your sample PDF like this

    public void testtestDrJorge() throws IOException
    {
        try (InputStream resource = getClass().getResourceAsStream("testDrJorge.pdf"))
        {
            PDDocument document = PDDocument.load(resource);
            extractPatternImages(document, "testDrJorge%s.%s");;
        }
    }
    

    (ExtractPatternImages.java)

    and got two images:

    • `testDrJorge-0-R15-R14.png

    • testDrJorge-0-R38-R37.png

    The images have lost their red parts. This most likely is dues to the fact that PDFBox version 1.x.x do not properly support extraction of CMYK images, cf. PDFBOX-2128 (CMYK images are not supported correctly), and your images are in CMYK.

    PDFBox 2.0.0 release candidate

    I updated the code to PDFBox 2.0.0 (currently available as release candidate only):

    public void extractPatternImages(PDDocument document, String fileNameFormat) throws IOException
    {
        PDPageTree pages = document.getDocumentCatalog().getPages();
        if (pages == null)
            return;
    
        for (int i = 0; i < pages.getCount(); i++)
        {
            String pageFormat = String.format(fileNameFormat, "-" + i + "%s", "%s");
            extractPatternImages(pages.get(i), pageFormat);
        }
    }
    
    public void extractPatternImages(PDPage page, String pageFormat) throws IOException
    {
        PDResources resources = page.getResources();
        if (resources == null)
            return;
        Iterable<COSName> patternNames = resources.getPatternNames();
    
        for (COSName patternName : patternNames)
        {
            String patternFormat = String.format(pageFormat, "-" + patternName + "%s", "%s");
            extractPatternImages(resources.getPattern(patternName), patternFormat);
        }
    }
    
    public void extractPatternImages(PDAbstractPattern pattern, String patternFormat) throws IOException
    {
        COSDictionary resourcesDict = (COSDictionary) pattern.getCOSObject().getDictionaryObject(COSName.RESOURCES);
        if (resourcesDict == null)
            return;
        PDResources resources = new PDResources(resourcesDict);
        Iterable<COSName> xObjectNames = resources.getXObjectNames();
        if (xObjectNames == null)
            return;
    
        for (COSName xObjectName : xObjectNames)
        {
            PDXObject xObject = resources.getXObject(xObjectName);
            String xObjectFormat = String.format(patternFormat, "-" + xObjectName + "%s", "%s");
            if (xObject instanceof PDFormXObject)
                extractPatternImages((PDFormXObject)xObject, xObjectFormat);
            else if (xObject instanceof PDImageXObject)
                extractPatternImages((PDImageXObject)xObject, xObjectFormat);
        }
    }
    
    public void extractPatternImages(PDFormXObject form, String imageFormat) throws IOException
    {
        PDResources resources = form.getResources();
        if (resources == null)
            return;
        Iterable<COSName> xObjectNames = resources.getXObjectNames();
        if (xObjectNames == null)
            return;
    
        for (COSName xObjectName : xObjectNames)
        {
            PDXObject xObject = resources.getXObject(xObjectName);
            String xObjectFormat = String.format(imageFormat, "-" + xObjectName + "%s", "%s");
            if (xObject instanceof PDFormXObject)
                extractPatternImages((PDFormXObject)xObject, xObjectFormat);
            else if (xObject instanceof PDImageXObject)
                extractPatternImages((PDImageXObject)xObject, xObjectFormat);
        }
    
        Iterable<COSName> patternNames = resources.getPatternNames();
    
        for (COSName patternName : patternNames)
        {
            String patternFormat = String.format(imageFormat, "-" + patternName + "%s", "%s");
            extractPatternImages(resources.getPattern(patternName), patternFormat);
        }
    }
    
    public void extractPatternImages(PDImageXObject image, String imageFormat) throws IOException
    {
        String filename = String.format(imageFormat, "", image.getSuffix());
        ImageIOUtil.writeImage(image.getOpaqueImage(), "png", new FileOutputStream(filename));
    }
    

    and get

    • testDrJorge-0-COSName{R15}-COSName{R14}.png

    • testDrJorge-0-COSName{R38}-COSName{R37}.png

    Looks like an improvement... ;)

    这篇关于从PDF中提取不可选择的内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆