如何使用过滤器"[/FlateDecode,/RunLengthDecode]"解码PdfImageObject; [英] How to decode a PdfImageObject with filter "[/FlateDecode, /RunLengthDecode]"

查看:104
本文介绍了如何使用过滤器"[/FlateDecode,/RunLengthDecode]"解码PdfImageObject;的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几年来,我已经成功地从PDF提取图像. 我使用itextsharp来做到这一点. 我得到一个PdfImageObject并得到过滤器. 通常,此过滤器是"/FlateDecode". 在这种情况下,我使用pdf.PdfReader.FlateDecode(bytes,True)解码原始字节.

但是最近我遇到了带有PdfImageObjects的pdf过滤器:"[/FlateDecode,/RunLengthDecode]".

所以我想原始字节必须解码两次!

我在互联网上找到了/RunLengthDecode部分的一些代码: https://github.com/kusl/itextsharp/blob/master/tags/iTextSharp_5_4_5/src/core/iTextSharp/text/pdf/FilterHandlers.cs

我尝试将两个解码选项都应用到图像上. 首先是/FlateDecode,然后是/RunLengthDecode. 然后是第二个/RunLengthDecode,然后是/FlateDecode.

但是/RunLengthDecode代码在两种情况下都给我一个错误.

解决方案

这实际上不是对问题的答案,而是对导致该问题的问题的分析.

在对该问题的评论中,事实证明iText中的错误是OP尝试手动过滤原始流并提取图像的原因:提取某些图像时存在一些小错误. OP将有问题的图像识别为带有滤镜[/FlateDecode, /RunLengthDecode]的图像.

错误

存在问题的错误确实是iText对 RunLengthDecode 过滤器的实现,这里来自iText for .Net 5.5.x:

private class Filter_RUNLENGTHDECODE : IFilterHandler {

    public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) {
     // allocate the output buffer
        MemoryStream baos = new MemoryStream();
        sbyte dupCount = -1;
        for (int i = 0; i < b.Length; i++){
            dupCount = (sbyte)b[i];
            if (dupCount == -128) break; // this is implicit end of data

            if (dupCount >= 0 && dupCount <= 127){
                int bytesToCopy = dupCount+1;
                baos.Write(b, i, bytesToCopy);
                i+=bytesToCopy;
            } else {
                // make dupcount copies of the next byte
                i++;
                for (int j = 0; j < 1-(int)(dupCount);j++){ 
                    baos.WriteByte(b[i]);
                }
            }
        }
        return baos.ToArray();
    }
}

更确切地说是这一行:

                baos.Write(b, i, bytesToCopy);

它应该已经复制了下一个bytesToCopy个字节 索引i-在索引i处毕竟有计数值-但这命令会复制下一个bytesToCopy字节 起始于 索引i.因此,对于每个要复制一次的字节运行,iText会先复制计数字节,然后再复制运行的最后一个字节.

相反,该行应为

                baos.Write(b, i+1, bytesToCopy);

对位图图像的效果示例

由于正确提取了重复字节的运行,并且即使是长时间的非重复运行,也有许多正确的字节(在不合一的位置),所以提取的iText图像看起来只是略有错误,并带有小错误,例如: /p>

损坏的图像:

未损坏的图像:

漏洞的普遍性

此错误已存在于.Net的iText 5.x中很多年了.此外,它在Java的iText 5.x中也已经存在很多年了,例如现在仍然如此.来自当前的5.5.13-SNAPSHOT:

private static class Filter_RUNLENGTHDECODE implements FilterHandler{

    public byte[] decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) throws IOException {
     // allocate the output buffer
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte dupCount = -1;
        for(int i = 0; i < b.length; i++){
            dupCount = b[i];
            if (dupCount == -128) break; // this is implicit end of data

            if (dupCount >= 0 && dupCount <= 127){
                int bytesToCopy = dupCount+1;
                baos.write(b, i, bytesToCopy);
                i+=bytesToCopy;
            } else {
                // make dupcount copies of the next byte
                i++;
                for(int j = 0; j < 1-(int)(dupCount);j++){ 
                    baos.write(b[i]);
                }
            }
        }

        return baos.toByteArray();
    }
}

以及在iText 7中,例如从当前的7.1.2-SNAPSHOT for Java此处:

public class RunLengthDecodeFilter implements IFilterHandler {

    @Override
    public byte[] decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte dupCount;
        for (int i = 0; i < b.length; i++) {
            dupCount = b[i];
            if (dupCount == (byte) 0x80) { // this is implicit end of data
                break;
            }
            if (dupCount >= 0) {
                int bytesToCopy = dupCount + 1;
                baos.write(b, i, bytesToCopy);
                i += bytesToCopy;
            } else {                // make dupcount copies of the next byte
                i++;
                for (int j = 0; j < 1 - (int) (dupCount); j++) {
                    baos.write(b[i]);
                }
            }
        }
        return baos.toByteArray();
    }
}

由于 RunLengthDecode 过滤器已经使用了很多年,因此该错误很可能会保留很长时间.

I'm already succesfully extracting images from PDF's since a few years. I use itextsharp to do this. I get a PdfImageObject and get the filter. Mostly this filter is "/FlateDecode". In that case,I use pdf.PdfReader.FlateDecode(bytes, True) to decode the raw bytes.

But recently I'm confronted with pdf's with PdfImageObjects with filter: "[/FlateDecode, /RunLengthDecode]".

So I guess that the raw bytes must be decoded twice!?!?

I found some code on the internet for the /RunLengthDecode part: https://github.com/kusl/itextsharp/blob/master/tags/iTextSharp_5_4_5/src/core/iTextSharp/text/pdf/FilterHandlers.cs

I tried to apply both decode options on the image. First /FlateDecode and then /RunLengthDecode. And second /RunLengthDecode and then /FlateDecode.

But the /RunLengthDecode code gives me in both scenarios an error.

解决方案

This actually is not an answer to the question as is but an analysis of the problem that led to this question.

In comments to the question it turned out that a bug in iText is the reason why the OP tries to manually filter raw streams and extract images: Certain images were extracted with small errors. The OP identified the problematic images to be those with filters [/FlateDecode, /RunLengthDecode].

The bug

The bug in question indeed is iText's implementation of the RunLengthDecode filter, here from iText for .Net 5.5.x:

private class Filter_RUNLENGTHDECODE : IFilterHandler {

    public byte[] Decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) {
     // allocate the output buffer
        MemoryStream baos = new MemoryStream();
        sbyte dupCount = -1;
        for (int i = 0; i < b.Length; i++){
            dupCount = (sbyte)b[i];
            if (dupCount == -128) break; // this is implicit end of data

            if (dupCount >= 0 && dupCount <= 127){
                int bytesToCopy = dupCount+1;
                baos.Write(b, i, bytesToCopy);
                i+=bytesToCopy;
            } else {
                // make dupcount copies of the next byte
                i++;
                for (int j = 0; j < 1-(int)(dupCount);j++){ 
                    baos.WriteByte(b[i]);
                }
            }
        }
        return baos.ToArray();
    }
}

More exactly it is this line:

                baos.Write(b, i, bytesToCopy);

It should have copied the next bytesToCopy bytes after index i -- at index i there is the count value after all -- but this command copies the next bytesToCopy bytes starting at index i. Thus, for every run of bytes to copy once iText instead first copies the count byte and then all but the final byte of the run.

Instead the line should be

                baos.Write(b, i+1, bytesToCopy);

Example effect on bitmap images

As runs of duplicate bytes are correctly extracted and even for long, non-duplicate runs there are many correct bytes (at off-by-one positions), the images iText extracted only look slightly wrong with small errors, e.g.:

Damaged image:

Undamaged image:

Pervasiveness of the bug

This bug has been in iText 5.x for .Net for many years. Furthermore, it has also been present in iText 5.x for Java for many years and still is, e.g. here from the current 5.5.13-SNAPSHOT:

private static class Filter_RUNLENGTHDECODE implements FilterHandler{

    public byte[] decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) throws IOException {
     // allocate the output buffer
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte dupCount = -1;
        for(int i = 0; i < b.length; i++){
            dupCount = b[i];
            if (dupCount == -128) break; // this is implicit end of data

            if (dupCount >= 0 && dupCount <= 127){
                int bytesToCopy = dupCount+1;
                baos.write(b, i, bytesToCopy);
                i+=bytesToCopy;
            } else {
                // make dupcount copies of the next byte
                i++;
                for(int j = 0; j < 1-(int)(dupCount);j++){ 
                    baos.write(b[i]);
                }
            }
        }

        return baos.toByteArray();
    }
}

and in iText 7, e.g. here from the current 7.1.2-SNAPSHOT for Java:

public class RunLengthDecodeFilter implements IFilterHandler {

    @Override
    public byte[] decode(byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        byte dupCount;
        for (int i = 0; i < b.length; i++) {
            dupCount = b[i];
            if (dupCount == (byte) 0x80) { // this is implicit end of data
                break;
            }
            if (dupCount >= 0) {
                int bytesToCopy = dupCount + 1;
                baos.write(b, i, bytesToCopy);
                i += bytesToCopy;
            } else {                // make dupcount copies of the next byte
                i++;
                for (int j = 0; j < 1 - (int) (dupCount); j++) {
                    baos.write(b[i]);
                }
            }
        }
        return baos.toByteArray();
    }
}

Most likely this bug could remain that long because the RunLengthDecode filter hardly ever has been used for a number of years.

这篇关于如何使用过滤器"[/FlateDecode,/RunLengthDecode]"解码PdfImageObject;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆