如何使用HTML iTextSharp的转换为PDF [英] How to convert HTML to PDF using iTextSharp

查看:200
本文介绍了如何使用HTML iTextSharp的转换为PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用iTextSharp的到下面的HTML到PDF转换,但不知道从哪里开始:

<风格>
.headline {字体大小:200%}
< /风格>
&所述p为H.;
  该<&EM GT;为< / EM>
  <跨度类=头条的风格=文本修饰:下划线;>有的< / SPAN>
  <强>样品LT; EM>文字< / EM>< / STRONG>
  <跨度风格=颜色:红色;>&!LT; / SPAN>
&所述; / P>


解决方案

首先,HTML和PDF是不相关的,尽管他们在同一时间创建。 HTML是意在传达更高水平的信息,如段落和表。虽然有方法控制它,它最终取决于浏览器来绘制这些更高层次的概念。 PDF是为了传达的文件的和文件的必须看无论他们呈现相同的。

在HTML文档,你可能有一个段落是100%宽,这取决于你的显示器的宽度可能需要2线或10线和打印时它可能是7号线,当你看着它在你的手机可能需要20行。一个PDF文件,但是,必须独立的渲染设备,所以无论你的屏幕大小为必须始终渲染如出一辙。

由于在葡萄汁的以上,PD​​F格式不支持,如表或段落抽象的东西。有三种基本的东西,PDF支持:文本,线条/形状和图像。的(还有其他像注释和电影,但我试图保持简单在这里。)的在一个PDF文档,你不要说这里有一个段落,浏览器做你的事!相反,你说,在这个确切的x得出这样的文字,使用这个确切的字体也不要着急Y坐标,我previously计算文本的宽度,所以我知道它会都符合这一行。你也不要说这里有一个表,而是说在这个确切位置得出这样的文字,然后在我已经pviously计算的,所以我知道它会显得$ P $本等详细的位置绘制一个矩形周围的文字。

二,iText的和iTextSharp的解析HTML和CSS。而已。 ASP.Net,MVC,剃须刀,Struts的,春天,等等,都是HTML框架,但iText的/ iTextSharp的是100%不知道他们。同样的,DataGridViews,中继器,模板,视图等,这些都是特定于框架的抽象。这是负责从你选择的框架的HTML,iText的不会帮你。如果你得到一个异常说文档没有页或你认为iText的是不是我的解析HTML这几乎是一定的,你的don't实际上有HTML ,你只觉得你做的。

三,内置的是存在多年的类是 HTMLWorker 然而,这已被替换 XMLWorker 的Java / 。净)。零工作正在上 HTMLWorker 做它不支持CSS文件,并且只有有限的支持最基本的CSS属性和实际的在某些标记符的。如果你没有看到<一个href=\"http://sourceforge.net/p/itextsharp/$c$c/HEAD/tree/trunk/src/core/iTextSharp/text/html/HtmlTags.cs\">HTML在这个文件属性或CSS属性和价值,那么它可能不是由 HTMLWorker 支持。 XMLWorker 可能更复杂,但有时这些并发症也使其 < A HREF =htt​​p://stackoverflow.com/a/24512415/231316>更多信息 扩展

下面是C#code,显示了如何解析HTML标记成都会自动添加到您正在使用的文件iText的抽象。 C#和Java是非常相似的,所以应该是比较容易转换这一点。示例1使用内置的 HTMLWorker 来解析HTML字符串。由于只有内嵌样式所支持的类=标题被忽略,但一切应实际工作。例2是一样的第一个除了使用 XMLWorker 来代替。 【举例】#3还解析简单的CSS的例子。

  //创建一个字节数组,最终将持有我们的最终PDF
字节[]字节;//样板iTextSharp的设置在这里
//创建,我们可以写一个流,在这种情况下,一个MemoryStream
使用(VAR毫秒=新的MemoryStream()){    //创建一个文档iTextSharp的是一个PDF的抽象,但**不**一个PDF
    使用(VAR DOC =新的文件()){        //创建一个绑定到我们的PDF抽象,我们流的作家
        使用(VAR作家= PdfWriter.GetInstance(文件,MS)){            //打开写入文件
            doc.Open();            //我们的样本HTML和CSS
            VAR example_html = @&LT; P&gt;这&LT; EM&GT;为&lt; / EM&GT;&LT;跨度类=标题的风格=文字修饰:强调;&GT;有的&LT; / SPAN&GT;&LT ;强&GT;样品&LT; EM&GT;文字&lt; / EM&GT;&LT; / STRONG&GT;&LT;跨度风格=颜色:红色;&GT;!&LT; / SPAN&GT;&LT; / p&gt;中;
            VAR example_css = @标题{字体大小:200%}。            / ******************* *
             *示例#1 *
             * *
             *使用内置的HTMLWorker来解析HTML。 *
             *仅内联CSS的支持。 *
             * ********* /            //创建必将对我国文档的新HTMLWorker
            使用(VAR htmlWorker =新iTextSharp.text.html.simpleparser.HTMLWorker(DOC)){                // HTMLWorker不直接读取一个字符串,而是需要一个TextReader的(这StringReader子类)
                使用(VAR SR =新StringReader(example_html)){                    //解析HTML
                    htmlWorker.Parse(SR);
                }
            }            / ******************* *
             *例2 *
             * *
             *使用XMLWorker解析HTML。 *
             *只有内嵌CSS和链接的绝对*
             * CSS支持*
             * ********* /            // XMLWorker也从一个的TextReader从字符串读取并不能直接
            使用(VAR srHtml =新StringReader(example_html)){                //解析HTML
                。iTextSharp.tool.xml.XMLWorkerHelper.GetInstance()ParseXHtml(作家,DOC,srHtml);
            }            / ******************* *
             *例3 *
             * *
             *使用XMLWorker解析HTML和CSS *
             * ********* /            //为了读取CSS作为一个字符串,我们需要切换到不同的构造
            //这需要流,而不是TextReaders。
            //下面我们转换成字符串UTF8字节数组,敷那些MemoryStreams
            使用(VAR msCss =新的MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))){
                使用(VAR MSHTML =新的MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))){                    //解析HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance()ParseXHtml(作家,DOC,MSHTML,msCss)。
                }
            }
            doc.Close();
        }
    }    //所有的PDF东西后,上面做过并关闭,但**我们
    //关闭的MemoryStream,抓住所有的活动的字节从流
    字节= ms.ToArray();
}//现在我们只需要做这些字节的东西。
//在这里,我将其写入到磁盘上,但如果你是在ASP.Net你可能Response.BinaryWrite()他们。
//你也可以写字节到数据库中VARBINARY()列(但请不要),或者你
//可以把它们传递给进一步的PDF处理另一个函数。
VAR TESTFILE = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop)的test.pdf);
System.IO.File.WriteAllBytes(TESTFILE,字节);

I want to convert the below HTML to PDF using iTextSharp but don't know where to start:

<style>
.headline{font-size:200%}
</style>
<p>
  This <em>is </em>
  <span class="headline" style="text-decoration: underline;">some</span>
  <strong>sample<em> text</em></strong>
  <span style="color: red;">!!!</span>
</p>

解决方案

First, HTML and PDF are not related although they were created around the same time. HTML is intended to convey higher level information such as paragraphs and tables. Although there are methods to control it, it is ultimately up to the browser to draw these higher level concepts. PDF is intended to convey documents and the documents must "look" the same wherever they are rendered.

In an HTML document you might have a paragraph that's 100% wide and depending on the width of your monitor it might take 2 lines or 10 lines and when you print it it might be 7 lines and when you look at it on your phone it might take 20 lines. A PDF file, however, must be independent of the rendering device, so regardless of your screen size it must always render exactly the same.

Because of the musts above, PDF doesn't support abstract things like "tables" or "paragraphs". There are three basic things that PDF supports: text, lines/shapes and images. (There are other things like annotations and movies but I'm trying to keep it simple here.) In a PDF you don't say "here's a paragraph, browser do your thing!". Instead you say, "draw this text at this exact X,Y location using this exact font and don't worry, I've previously calculated the width of the text so I know it will all fit on this line". You also don't say "here's a table" but instead you say "draw this text at this exact location and then draw a rectangle at this other exact location that I've previously calculated so I know it will appear to be around the text".

Second, iText and iTextSharp parse HTML and CSS. That's it. ASP.Net, MVC, Razor, Struts, Spring, etc, are all HTML frameworks but iText/iTextSharp is 100% unaware of them. Same with DataGridViews, Repeaters, Templates, Views, etc. which are all framework-specific abstractions. It is your responsibility to get the HTML from your choice of framework, iText won't help you. If you get an exception saying The document has no pages or you think that "iText isn't parsing my HTML" it is almost definite that you don't actually have HTML, you only think you do.

Third, the built-in class that's been around for years is the HTMLWorker however this has been replaced with XMLWorker (Java / .Net). Zero work is being done on HTMLWorker which doesn't support CSS files and has only limited support for the most basic CSS properties and actually breaks on certain tags. If you do not see the HTML attribute or CSS property and value in this file then it probably isn't supported by HTMLWorker. XMLWorker can be more complicated sometimes but those complications also make it more extensible.

Below is C# code that shows how to parse HTML tags into iText abstractions that get automatically added to the document that you are working on. C# and Java are very similar so it should be relatively easy to convert this. Example #1 uses the built-in HTMLWorker to parse the HTML string. Since only inline styles are supported the class="headline" gets ignored but everything else should actually work. Example #2 is the same as the first except it uses XMLWorker instead. Example #3 also parses the simple CSS example.

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;

//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {

    //Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
    using (var doc = new Document()) {

        //Create a writer that's bound to our PDF abstraction and our stream
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            //Open the document for writing
            doc.Open();

            //Our sample HTML and CSS
            var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
            var example_css = @".headline{font-size:200%}";

            /**************************************************
             * Example #1                                     *
             *                                                *
             * Use the built-in HTMLWorker to parse the HTML. *
             * Only inline CSS is supported.                  *
             * ************************************************/

            //Create a new HTMLWorker bound to our document
            using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

                //HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
                using (var sr = new StringReader(example_html)) {

                    //Parse the HTML
                    htmlWorker.Parse(sr);
                }
            }

            /**************************************************
             * Example #2                                     *
             *                                                *
             * Use the XMLWorker to parse the HTML.           *
             * Only inline CSS and absolutely linked          *
             * CSS is supported                               *
             * ************************************************/

            //XMLWorker also reads from a TextReader and not directly from a string
            using (var srHtml = new StringReader(example_html)) {

                //Parse the HTML
                iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }

            /**************************************************
             * Example #3                                     *
             *                                                *
             * Use the XMLWorker to parse HTML and CSS        *
             * ************************************************/

            //In order to read CSS as a string we need to switch to a different constructor
            //that takes Streams instead of TextReaders.
            //Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
            using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
                using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
                }
            }


            doc.Close();
        }
    }

    //After all of the PDF "stuff" above is done and closed but **before** we
    //close the MemoryStream, grab all of the active bytes from the stream
    bytes = ms.ToArray();
}

//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);

这篇关于如何使用HTML iTextSharp的转换为PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆