无法在 PDF 文档中找到 ColorSpace 对象的位置 [英] Unable find location of ColorSpace objects in PDF document

查看:156
本文介绍了无法在 PDF 文档中找到 ColorSpace 对象的位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想识别 PDF 中的 ColorSpace 对象并获取它们在页面中的位置(颜色空间的坐标、宽度和高度).我尝试遍历 Contents.ContentContext.Resources.ColorSpaces 中的 BaseDataObject,我可以识别文件中的 Pantone 色彩空间(如截图所示),但无法找到有关对象位置(x,y,w and h)的信息.

在哪里可以找到可见对象(打开文档时可见)的确切位置,例如 ColorSpaces 和嵌入图像?

我正在使用pdfclown"库从 PDF 中提取有关 ColorSpaces 的信息.任何输入都会有所帮助.提前致谢.

ContentScanner cs = new ContentScanner(page);System.Collections.Generic.Listlist = cs.Contents.ContentContext.Resources.ColorSpaces.Values.ToList();for (int i = 0; i < list.Count; i++){org.pdfclown.objects.PdfArray array = (org.pdfclown.objects.PdfArray)list[i].BaseDataObject;foreach(数组中的 org.pdfclown.objects.PdfObject s){//打印色彩空间及其x,y,w,h}}

解决方案

我想识别 PDF 中的 ColorSpace 对象并获取它们在页面中的位置(颜色空间的坐标、宽度和高度).

我假设您指的是这里的方块:

注意,这些是不是 PDF ColorSpace 对象,这些是许多简单的(矩形)路径,填充了不同的颜色和一些在它们上绘制的文字.

PDF ColorSpace不是彩色区域的具体渲染,它们是抽象的颜色规范:

<块引用>

颜色可以用各种颜色系统或颜色空间中的任何一种来描述.一些颜色空间与设备颜色表示(灰度、RGB、CMYK)有关,另一些与人类视觉感知(基于 CIE)有关.某些特殊功能也被建模为颜色空间:图案、颜色映射、分离以及高保真和多色调颜色.

(ISO 32000-1,第 8.6 节色彩空间")

当您寻找具有坐标、宽度和高度的东西时,您正在寻找使用这些抽象色彩空间的绘图说明,而不是普通的色彩空间.

<块引用>

我尝试遍历 Contents.ContentContext.Resources.ColorSpaces 中的 BaseDataObject,我可以识别文件中的 Pantone 色彩空间(如屏幕截图所示),但无法找到有关 的信息对象的位置(x,y,w 和 h).

通过查看cs.Contents.ContentContext.Resources.ColorSpaces,您可以枚举当前上下文中可用的所有特殊颜色空间,但不是实际用途.要获得实际用法,您必须遍历 ContentScanner cs,即您必须检查当前上下文中的指令,例如像这样:

SeparationColorSpace space = null;双X = 0,Y = 0,宽度= 0,高度= 0;void ScanForSpecialColorspaceUsage(ContentScanner cs){cs.MoveFirst();而 (cs.MoveNext()){ContentObject 内容 = cs.Current;如果(内容是 CompositeObject){ScanForSpecialColorspaceUsage(cs.ChildLevel);}else if (内容是SetFillColorSpace _cs){ColorSpace _space = cs.Contents.ContentContext.Resources.ColorSpaces[_cs.Name];space = _space 作为 SeparationColorSpace;}else if (内容是SetDeviceCMYKFillColor || 内容是SetDeviceGrayFillColor || 内容是SetDeviceRGBFillColor){空间=空;}else if (内容是DrawRectangle _dr){如果(空格!= null){X = _dr.X;Y = _dr.Y;宽度 = _dr.Width;高度 = _dr.Height;}}else if (内容是PaintPath _pp){if (space != null && _pp.Filled && (X != 0 || Y != 0 || Width != 0 || Height != 0)){字符串名称 = ((PdfName)((PdfArray)space.BaseDataObject)[1]).ToString();Console.WriteLine(使用 {4}"、X、Y、宽度、高度、名称填充 {0}、{1} 处的大小为 {2}x{3} 的矩形);}X = 0;Y = 0;宽度 = 0;高度 = 0;}}}

注意:这只是一个概念验证,已尽可能简化,以便在您的 PDF 中仍可用于屏幕截图中的方块

对于通用解决方案,您必须对其进行大量扩展:

  • 该代码仅检查给定的内容扫描器,即仅检查其已为其初始化的内容流,在您的情况下为页面内容流.

    从这样的上下文流可以引用其他内容流,例如一个表单 XObject.要在通用文档中捕获所有有趣色彩空间的用法,您还必须递归检查此类相关内容流.

  • 代码忽略当前的变换矩阵.

    当前的变换矩阵可以通过指令改变,让所有按照指令完成的绘图都根据仿射变换改变坐标.要在通用文档中正确获取所有坐标和尺寸,您必须将当前变换矩阵应用于它们.

  • 代码忽略了 save-graphics-state/restore-graphics-state 指令.

    当前图形状态(包括填充颜色和当前变换矩阵)可以存储在堆栈中并从中恢复.要在通用文档中获得正确的颜色、坐标和尺寸,您必须跟踪保存和恢复的图形状态(或使用来自 cs.State 的数据进行颜色和转换,而 PDF Clown 为你).

  • 代码仅查看分离颜色空间.

    如果您也对其他色彩空间感兴趣,那么您已经对此进行了概括.

  • 代码只理解非常具体、简单的路径:只理解由定义矩形的单个指令生成的路径.

    对于通用解决方案,您必须支持任意路径.

I want to identify the ColorSpace objects in PDF and fetch their location(coordinates, width and height of the colorspace) in the page. I tried traversing through the BaseDataObject in Contents.ContentContext.Resources.ColorSpaces, I can identify the Pantone Colorspaces in file (as shown in screenshot), but unable to find info regarding the location(x,y,w and h) of the object.

Where can I find the exact location of the visible objects(visible on opening a document) like ColorSpaces and embedded images?

I am using 'pdfclown' library to extract the info about ColorSpaces from PDF. Any inputs will be helpful. Thanks in advance.

ContentScanner cs =  new ContentScanner(page);     
System.Collections.Generic.List<org.pdfclown.documents.contents.colorSpaces.ColorSpace> list = cs.Contents.ContentContext.Resources.ColorSpaces.Values.ToList();
    for (int i = 0; i < list.Count; i++)
    {
            org.pdfclown.objects.PdfArray array = (org.pdfclown.objects.PdfArray)list[i].BaseDataObject;
            foreach (org.pdfclown.objects.PdfObject s in array)
            { 
                //print colorspace and its x,y,w,h
            }
    }

PDF Document (has CMYK and Pantone Colors)

Screenshot

解决方案

I want to identify the ColorSpace objects in PDF and fetch their location(coordinates, width and height of the colorspace) in the page.

I assume you mean the squares here:

Beware, these are not PDF ColorSpace objects, these are a number of simple (rectangular) paths filled with distinct colors and with some text drawn upon them.

PDF ColorSpaces are not specific renderings of colored areas, they are abstract color specifications:

Colours may be described in any of a variety of colour systems, or colour spaces. Some colour spaces are related to device colour representation (grayscale, RGB, CMYK), others to human visual perception (CIE-based). Certain special features are also modelled as colour spaces: patterns, colour mapping, separations, and high-fidelity and multitone colour.

(ISO 32000-1, section 8.6 "Colour Spaces")

As you look for something with coordinates, width and height, therefore, you are looking for drawing instructions using those abstract color spaces, not for the plain color spaces.

I tried traversing through the BaseDataObject in Contents.ContentContext.Resources.ColorSpaces, I can identify the Pantone Colorspaces in file (as shown in screenshot), but unable to find info regarding the location(x,y,w and h) of the object.

By looking at cs.Contents.ContentContext.Resources.ColorSpaces you get an enumeration of all special color spaces available for use in the current context but not the actual usages. To get the actual usages, you have to traverse the ContentScanner cs, i.e. you have to inspect the instructions in the current context, e.g. like this:

SeparationColorSpace space = null;
double X = 0, Y = 0, Width = 0, Height = 0;

void ScanForSpecialColorspaceUsage(ContentScanner cs)
{
    cs.MoveFirst();
    while (cs.MoveNext())
    {
        ContentObject content = cs.Current;
        if (content is CompositeObject)
        {
            ScanForSpecialColorspaceUsage(cs.ChildLevel);
        }
        else if (content is SetFillColorSpace _cs)
        {
            ColorSpace _space = cs.Contents.ContentContext.Resources.ColorSpaces[_cs.Name];
            space = _space as SeparationColorSpace;
        }
        else if (content is SetDeviceCMYKFillColor || content is SetDeviceGrayFillColor || content is SetDeviceRGBFillColor)
        {
            space = null;
        }
        else if (content is DrawRectangle _dr)
        {
            if (space != null)
            {
                X = _dr.X;
                Y = _dr.Y;
                Width = _dr.Width;
                Height = _dr.Height;
            }
        }
        else if (content is PaintPath _pp)
        {
            if (space != null && _pp.Filled && (X != 0 || Y != 0 || Width != 0 || Height != 0))
            {
                String name = ((PdfName)((PdfArray)space.BaseDataObject)[1]).ToString();
                Console.WriteLine("Filling rectangle at {0}, {1} with size {2}x{3} using {4}", X, Y, Width, Height, name);
            }
            X = 0;
            Y = 0;
            Width = 0;
            Height = 0;
        }
    }
}

BEWARE: This merely is a proof-of-concept, simplified as much as possible to still work in your PDF for the squares in the screen shot above.

For a general solution you will have to extend this considerably:

  • The code only inspects the given content scanner, i.e. only the content stream it has been initialized for, in your case a page content stream.

    From such a context stream other content streams may be referenced, e.g. a form XObject. To catch all the usages of interesting color spaces in a generic document, you have to recursively inspect such dependent content streams, too.

  • The code ignores the current transformation matrix.

    The current transformation matrix can be changed by an instruction to have all the drawings done by following instructions have their coordinates changed according to an affine transformation. To get all coordinates and dimensions right in a generic document, you have to apply the current transformation matrix to them.

  • The code ignores save-graphics-state/restore-graphics-state instructions.

    The current graphics state (including fill color and current transformation matrix) can be stored on a stack and restored from it. To get colors, coordinates and dimensions right in a generic document, you have to keep track of saved and restored graphics states (or use data from the cs.State for color and transformation where PDF Clown does this for you).

  • The code only looks at Separation color spaces.

    If you're interested in other color spaces, too, you have generalize this.

  • The code only understands very specific, trivial paths: only paths generated by a single instruction defining a rectangle.

    For a generic solution you have to support arbitrary paths.

这篇关于无法在 PDF 文档中找到 ColorSpace 对象的位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆