使用iTextSharp的提取,并在现有的PDF更新链接 [英] using ITextSharp to extract and update links in an existing PDF

查看:297
本文介绍了使用iTextSharp的提取,并在现有的PDF更新链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要张贴几个(阅读:很多)的PDF文件的网页,但其中不乏有硬codeD文件://链接和链接到非公共场所。我需要通过这些PDF文件阅读和更新链接到正确的位置。我已经开始写使用iTextSharp的通过目录和文件阅读,查找的PDF文件,并​​通过每个页面重复的应用程序。我下一步需要做的是找到链接,然后更新不正确的。

I need to post several (read: a lot) PDF files to the web but many of them have hard coded file:// links and links to non-public locations. I need to read through these PDFs and update the links to the proper locations. I've started writing an app using itextsharp to read through the directories and files, find the PDFs and iterate through each page. What I need to do next is find the links and then update the incorrect ones.

string path = "c:\\html";
DirectoryInfo rootFolder = new DirectoryInfo(path);

foreach (DirectoryInfo di in rootFolder.GetDirectories())
{
    // get pdf
    foreach (FileInfo pdf in di.GetFiles("*.pdf"))
    {
        string contents = string.Empty;
        Document doc = new Document();
        PdfReader reader = new PdfReader(pdf.FullName);

        using (MemoryStream ms = new MemoryStream())
        {
            PdfWriter writer = PdfWriter.GetInstance(doc, ms);
            doc.Open();

            for (int p = 1; p <= reader.NumberOfPages; p++)
            {
                byte[] bt = reader.GetPageContent(p);

            }
        }
    }
}

坦率地说,一旦我得到的页面内容,我宁愿失去这个当谈到iTextSharp的。我已经通过SourceForge上的iTextSharp的例子看,但真的没发现什么,我一直在寻找。

Quite frankly, once I get the page content I'm rather lost on this when it comes to iTextSharp. I've read through the itextsharp examples on sourceforge, but really didn't find what I was looking for.

任何帮助将大大AP preciated。

Any help would be greatly appreciated.

感谢。

推荐答案

这一个是有点复杂,如果你不知道的PDF格式,iText的/ iTextSharp的的抽象/实现它的内部结构。您需要了解如何使用 PdfDictionary 对象,并通过他们的 PdfName 键看东西。一旦你,你可以通读官方PDF规格并轻松地闲逛文件pretty 。如果你做护理我已经包括了PDF规范的括号中的相关部分适用。

This one is a little complicated if you don't know the internals of the PDF format and iText/iTextSharp's abstraction/implementation of it. You need to understand how to use PdfDictionary objects and look things up by their PdfName key. Once you get that you can read through the official PDF spec and poke around a document pretty easily. If you do care I've included the relevant parts of the PDF spec in parenthesis where applicable.

不管怎么说,一个PDF中的链接存储为注释( PDF参考12.5 )。注释是基于页面的,所以你需要先单独获得每个页面的注释数组。有一堆不同的可能类型的注解,所以你需要检查每个人的亚型,看看其设置为链接 12.5.6.5 )。每一个环节的的有一个 ACTION 与它( 12.6.2 )和你相关的字典要检查动作的取值关键看它是什么类型的动作。有这一堆可能的人,链接的具体可能是内部链接或打开文件的链接或播放声音链接或别的东西( 12.6.4.1 )。您正在寻找只为那些类型的 URI (注意字母链接 I 而不是字母)。 URI操作( 12.6.4.7 )具有保存实际地址导航到 URI 键。 (还有影像地图,我无法想象居然有人用一个 ISMAP 属性。)

Anyways, a link within a PDF is stored as an annotation (PDF Ref 12.5). Annotations are page-based so you need to first get each page's annotation array individually. There's a bunch of different possible types of annotations so you need to check each one's SUBTYPE and see if its set to LINK (12.5.6.5). Every link should have an ACTION dictionary associated with it (12.6.2) and you want to check the action's S key to see what type of action it is. There's a bunch of possible ones for this, link's specifically could be internal links or open file links or play sound links or something else (12.6.4.1). You are looking only for links that are of type URI (note the letter I and not the letter L). URI Actions (12.6.4.7) have a URI key that holds the actual address to navigate to. (There's also an IsMap property for image maps that I can't actually imagine anyone using.)

哇。还在读书?下面是一个完整的工作VS 2010的C#WinForms应用程序<一href=\"http://stackoverflow.com/questions/6578316/editing-hyperlink-and-anchors-in-pdf-using-itextsharp/6599734#6599734\">based在我的岗位这里目标iTextSharp的5.1.1.0。这code做两件事:1)创建于它指向Google.com链接样本PDF和2)替换为一个链接bing.com该链接。在code应该是pretty很好的注释,但随意问任何问题,你可能有。

Whew. Still reading? Below is a full working VS 2010 C# WinForms app based on my post here targeting iTextSharp 5.1.1.0. This code does two main things: 1) Create a sample PDF with a link in it pointing to Google.com and 2) replaces that link with a link to bing.com. The code should be pretty well commented but feel free to ask any questions that you might have.

using System;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text;
using iTextSharp.text.pdf;
using System.IO;

namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {

        //Folder that we are working in
        private static readonly string WorkingFolder = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Hyperlinked PDFs");
        //Sample PDF
        private static readonly string BaseFile = Path.Combine(WorkingFolder, "OldFile.pdf");
        //Final file
        private static readonly string OutputFile = Path.Combine(WorkingFolder, "NewFile.pdf");

        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            CreateSamplePdf();
            UpdatePdfLinks();
            this.Close();
        }

        private static void CreateSamplePdf()
        {
            //Create our output directory if it does not exist
            Directory.CreateDirectory(WorkingFolder);

            //Create our sample PDF
            using (iTextSharp.text.Document Doc = new iTextSharp.text.Document(PageSize.LETTER))
            {
                using (FileStream FS = new FileStream(BaseFile, FileMode.Create, FileAccess.Write, FileShare.Read))
                {
                    using (PdfWriter writer = PdfWriter.GetInstance(Doc, FS))
                    {
                        Doc.Open();

                        //Turn our hyperlink blue
                        iTextSharp.text.Font BlueFont = FontFactory.GetFont("Arial", 12, iTextSharp.text.Font.NORMAL, iTextSharp.text.BaseColor.BLUE);

                        Doc.Add(new Paragraph(new Chunk("Go to URL", BlueFont).SetAction(new PdfAction("http://www.google.com/", false))));

                        Doc.Close();
                    }
                }
            }
        }

        private static void UpdatePdfLinks()
        {
            //Setup some variables to be used later
            PdfReader R = default(PdfReader);
            int PageCount = 0;
            PdfDictionary PageDictionary = default(PdfDictionary);
            PdfArray Annots = default(PdfArray);

            //Open our reader
            R = new PdfReader(BaseFile);
            //Get the page cont
            PageCount = R.NumberOfPages;

            //Loop through each page
            for (int i = 1; i <= PageCount; i++)
            {
                //Get the current page
                PageDictionary = R.GetPageN(i);

                //Get all of the annotations for the current page
                Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);

                //Make sure we have something
                if ((Annots == null) || (Annots.Length == 0))
                    continue;

                //Loop through each annotation

                foreach (PdfObject A in Annots.ArrayList)
                {
                    //Convert the itext-specific object as a generic PDF object
                    PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);

                    //Make sure this annotation has a link
                    if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
                        continue;

                    //Make sure this annotation has an ACTION
                    if (AnnotationDictionary.Get(PdfName.A) == null)
                        continue;

                    //Get the ACTION for the current annotation
                    PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);

                    //Test if it is a URI action
                    if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
                    {
                        //Change the URI to something else
                        AnnotationAction.Put(PdfName.URI, new PdfString("http://www.bing.com/"));
                    }
                }
            }

            //Next we create a new document add import each page from the reader above
            using (FileStream FS = new FileStream(OutputFile, FileMode.Create, FileAccess.Write, FileShare.None))
            {
                using (Document Doc = new Document())
                {
                    using (PdfCopy writer = new PdfCopy(Doc, FS))
                    {
                        Doc.Open();
                        for (int i = 1; i <= R.NumberOfPages; i++)
                        {
                            writer.AddPage(writer.GetImportedPage(R, i));
                        }
                        Doc.Close();
                    }
                }
            }
        }
    }
}

修改

我要指出,这只是改变实际的链接。在文档中的任何文本将不会被更新。注解绘制文本的顶部,但并没有真正捆绑到文本,无论如何下方。这是另一个话题完全。

I should note, this only changes the actual link. Any text within the document won't get updated. Annotations are drawn on top of text but aren't really tied to the text underneath in anyway. That's another topic completely.

这篇关于使用iTextSharp的提取,并在现有的PDF更新链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆