iTextSharp:将PdfObject转换为PdfStream [英] iTextSharp: Convert PdfObject to PdfStream

查看:2162
本文介绍了iTextSharp:将PdfObject转换为PdfStream的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从pdf文件中提取一些字体流(合法性不是问题,因为我的公司已经支付以原始方式显示这些文档的权利 - 这需要转换,这需要提取字体)。

现在,我已经使用了MUTool,但是它也提取PDF中的图像,也没有绕过它们的方法,其中一些包含10s成千上万的图像。所以,我到网上寻找答案,并得到了以下解决方案:

我把所有的字体变成一个字体字典,然后我试图将它们转换成PdfStreams(用于flatedecode,然后写入文件)使用以下代码:

pre $ t pre $ P $ T $ PdfDictionary tg =(PdfDictionary)PdfReader.GetPdfObject(( PdfObject)cItem.pObj);
PdfName type =(PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
try
{

int xrefIdx =((PRIndirectReference)((PdfObject)cItem.pObj))。
PdfObject pdfObj =(PdfObject)reader.GetPdfObject(xrefIdx);
PdfStream str =(PdfStream)(pdfObj);

byte [] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);

catch {}

但是,当我到 PdfStream str =(PdfStream)(pdfObj); 我得到以下错误:

 无法投射类型的对象'iTextSharp.text.pdf.PdfDictionary'
键入'iTextSharp.text.pdf.PdfStream'。

现在,我知道PdfDictionary派生自(扩展)PdfObject,所以我不确定我是什么这里做错了。有人请帮忙 - 我需要修补这个代码的建议,或者如果完全不正确的话,要么是正确提取流的代码,要么是带有上述代码的地方。

谢谢

编辑
我修改过的代码在这里:

 public static void GetStreams(PdfReader pdf)
{
int page_count = pdf.NumberOfPages;
for(int i = 1; i <= page_count; i ++)
{
PdfDictionary pg = pdf.GetPageN(i);
PdfDictionary fObj =(PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.FONT));
if(fObj!= null)
{
foreach(fObj.Keys中的PdfName名称)
{
PdfObject obj = fObj.Get(name);
if(obj.IsIndirect())
{
PdfDictionary tg =(PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type =(PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

int xrefIdx =((PRIndirectReference)obj).Number;
PdfObject pdfObj = pdf.GetPdfObject(xrefIdx);
if(pdfObj == null&& pdfObj.IsStream())
{
PdfStream str =(PdfStream)(pdfObj);
byte [] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);





$ b code $ pre

但是,我仍然收到相同的错误 - 所以我假设这是一个不正确的检索字体流的方法。同样的文档已经使用muTool成功提取字体 - 所以我知道这个问题是我而不是PDF。

解决方案

您的代码中至少有两件事是错误的:


  1. 如果不进行此项检查,将对象转换为流: if (pdfObj == null&& pdfObj.isStream()){//转换为流} 当你试图将字典转换为流的错误消息时, m 99%确定检查的第二部分将返回 false ,而 pdfObj.isDictionary()可能会返回 true

  2. 尝试从 PdfReader 提取流,将该对象转换为 PdfStream 而不是 PRStream PdfStream 是我们用于创建PDF的对象, PRStream 是我们使用<$ c $检查PDF时使用的对象c> PdfReader 。

您应该先解决这个问题。 b $ b

现在为您的一般问题。如果您阅读ISO-32000-1,则会发现字体是使用字体字典定义的。如果字体被嵌入(全部或部分),字体字典将引用一个流。这个流可以包含完整的字体信息,但是大多数情况下,你只能得到一个字形的一个子集(因为这是创建PDF时的最佳实践)。



从我的书 ListFontFiles :http://itextpdf.com/bookrel =nofollow>iText in Action,以获得PDF中字体组织的第一印象。您需要将此示例与ISO-32000-1结合,以查找有关 FONTFILE FONTFILE2 和 FONTFILE3



我还写了一个用字体文件替换非嵌入字体的例子:< EmbedFontPostFacto 。这个例子可以作为一个介绍来解释字体替换是多么困难。



请到 http://tinyurl.com/iiacsCH16 ,如果你需要C#版本的书样。


I am attempting to pull some font streams out of a pdf file (legality is not an issue, as my company has paid for the rights to display these documents in their original manner - and this requires a conversion which requires the extraction of the fonts).

Now, I had been using MUTool - but it also extracts the images in the pdf as well with no method for bypassing them and some of these contain 10s of thousands of images. So, I took to the web for answers and have come to the following solution:

I get all of the fonts into a font dictionary and then I attempt to convert them into PdfStreams (for flatedecode and then writing to files) using the following code:

    PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject((PdfObject)cItem.pObj);
        PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
        try
        {

            int xrefIdx = ((PRIndirectReference)((PdfObject)cItem.pObj)).Number;
            PdfObject pdfObj = (PdfObject)reader.GetPdfObject(xrefIdx);
            PdfStream str = (PdfStream)(pdfObj);

            byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
        }
        catch { }

But, when I get to PdfStream str = (PdfStream)(pdfObj); I get the error below:

    Unable to cast object of type 'iTextSharp.text.pdf.PdfDictionary' 
    to type 'iTextSharp.text.pdf.PdfStream'.

Now, I know that PdfDictionary derives from (extends) PdfObject so I am uncertain as to what I am doing incorrectly here. Someone please help - I either need advice on patching this code, or if entirely incorrect, either code to extract the stream properly or direction to a place with said code.

Thank you.

EDIT My revised code is here:

     public static void GetStreams(PdfReader pdf)
    {
        int page_count = pdf.NumberOfPages;
        for (int i = 1; i <= page_count; i++)
        {
            PdfDictionary pg = pdf.GetPageN(i);
            PdfDictionary fObj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.FONT));
            if (fObj != null)
            {
                foreach (PdfName name in fObj.Keys)
                {
                    PdfObject obj = fObj.Get(name);
                    if (obj.IsIndirect())
                    {
                        PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
                        PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

                        int xrefIdx = ((PRIndirectReference)obj).Number;
                        PdfObject pdfObj = pdf.GetPdfObject(xrefIdx);
                        if (pdfObj == null && pdfObj.IsStream())
                        {
                            PdfStream str = (PdfStream)(pdfObj);
                            byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);
                        }
                    }
                }
            }
        }
    }

However, I am still receiving the same error - so I am assuming that this is an incorrect method of retrieving font streams. The same document has had fonts extracted using muTool successfully - so I know the problem is me and not the pdf.

解决方案

There are at least two things wrong in your code:

  1. You cast an object to a stream without performing this check: if (pdfObj == null && pdfObj.isStream()) { // cast to stream } As you get the error message that you're trying to cast a dictionary to a stream, I'm 99% sure that the second part of the check will return false whereas pdfObj.isDictionary() probably returns true.
  2. You try extracting a stream from PdfReader and you're trying to cast that object to a PdfStream instead of to a PRStream. PdfStream is the object we use to create PDFs, PRStream is the object used when we inspect PDFs using PdfReader.

You should fix this problem first.

Now for your general question. If you read ISO-32000-1, you'll discover that a font is defined using a font dictionary. If the font is embedded (fully or partly), the font dictionary will refer to a stream. This stream can contain the full font information, but most of the times, you'll only get a subset of the glyphs (because that's best practice when creating a PDF).

Take a look at the example ListFontFiles from my book "iText in Action" to get a first impression of how fonts are organized inside a PDF. You'll need to combine this example with ISO-32000-1 to find more info about the difference between FONTFILE, FONTFILE2 and FONTFILE3.

I've also written an example that replaces an unembedded font with a font file: EmbedFontPostFacto. This example serves as an introduction to explain how difficult font replacement is.

Please go to http://tinyurl.com/iiacsCH16 if you need the C# version of the book samples.

这篇关于iTextSharp:将PdfObject转换为PdfStream的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆