如何使用 iTextSharp 将 HTML 转换为 PDF [英] How to convert HTML to PDF using iTextSharp

查看:47
本文介绍了如何使用 iTextSharp 将 HTML 转换为 PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 iTextSharp 将以下 HTML 转换为 PDF,但不知道从哪里开始:

I want to convert the below HTML to PDF using iTextSharp but don't know where to start:

<style>
.headline{font-size:200%}
</style>
<p>
  This <em>is </em>
  <span class="headline" style="text-decoration: underline;">some</span>
  <strong>sample<em> text</em></strong>
  <span style="color: red;">!!!</span>
</p>

推荐答案

首先,HTML 和 PDF 是不相关的,尽管它们大约是在同一时间创建的.HTML 旨在传达更高级别的信息,例如段落和表格.尽管有方法可以控制它,但最终还是要由浏览器来绘制这些更高级别的概念.PDF 旨在传达文档,并且文档必须看起来"无论在何处呈现都相同.

First, HTML and PDF are not related although they were created around the same time. HTML is intended to convey higher level information such as paragraphs and tables. Although there are methods to control it, it is ultimately up to the browser to draw these higher level concepts. PDF is intended to convey documents and the documents must "look" the same wherever they are rendered.

在 HTML 文档中,您可能有一个 100% 宽的段落,根据显示器的宽度,它可能需要 2 行或 10 行,打印时可能是 7 行,当您在显示器上查看时电话可能需要 20 条线.但是,PDF 文件必须独立于渲染设备,因此无论您的屏幕大小如何,它必须始终渲染完全相同.

In an HTML document you might have a paragraph that's 100% wide and depending on the width of your monitor it might take 2 lines or 10 lines and when you print it it might be 7 lines and when you look at it on your phone it might take 20 lines. A PDF file, however, must be independent of the rendering device, so regardless of your screen size it must always render exactly the same.

由于上述必须,PDF 不支持抽象的东西,如表格"或段落".PDF 支持三种基本内容:文本、线条/形状和图像.(还有诸如注释和电影之类的其他内容,但我在这里尽量保持简单.) 在 PDF 中,您不会说这是一个段落,浏览器做您的事!".相反,您会说,使用这种确切的字体在这个确切的 X、Y 位置绘制此文本,别担心,我之前已经计算过文本的宽度,所以我知道它会全部适合这条线".您也不会说这是一张桌子",而是说在这个确切位置绘制此文本,然后在我之前计算过的另一个确切位置绘制一个矩形,因此我知道它会出现在文本周围".

Because of the musts above, PDF doesn't support abstract things like "tables" or "paragraphs". There are three basic things that PDF supports: text, lines/shapes and images. (There are other things like annotations and movies but I'm trying to keep it simple here.) In a PDF you don't say "here's a paragraph, browser do your thing!". Instead you say, "draw this text at this exact X,Y location using this exact font and don't worry, I've previously calculated the width of the text so I know it will all fit on this line". You also don't say "here's a table" but instead you say "draw this text at this exact location and then draw a rectangle at this other exact location that I've previously calculated so I know it will appear to be around the text".

其次,iText 和 iTextSharp 解析 HTML 和 CSS.就是这样.ASP.Net、MVC、Razor、Struts、Spring 等都是 HTML 框架,但 iText/iTextSharp 100% 不知道它们.与 DataGridViews、Repeater、模板、视图等相同,它们都是特定于框架的抽象.从您选择的框架中获取 HTML 是的责任,iText 不会帮助您.如果您收到异常说 文档没有页面 或者您认为iText 未解析我的 HTML",则几乎可以肯定您 实际上 有HTML,你只是认为你这样做了.

Second, iText and iTextSharp parse HTML and CSS. That's it. ASP.Net, MVC, Razor, Struts, Spring, etc, are all HTML frameworks but iText/iTextSharp is 100% unaware of them. Same with DataGridViews, Repeaters, Templates, Views, etc. which are all framework-specific abstractions. It is your responsibility to get the HTML from your choice of framework, iText won't help you. If you get an exception saying The document has no pages or you think that "iText isn't parsing my HTML" it is almost definite that you don't actually have HTML, you only think you do.

第三,已经存在多年的内置类是 HTMLWorker 但是它已被替换为 XMLWorker (Java/.Net).正在对 HTMLWorker 进行零工作,它不支持 CSS 文件,并且对最基本的 CSS 属性只有有限的支持,实际上 某些标签上的中断.如果您没有看到 此文件中的 HTML 属性或 CSS 属性和值 那么 HTMLWorker 可能不支持它.XMLWorker 有时会更复杂,但这些复杂性也让它 更多 可扩展.

Third, the built-in class that's been around for years is the HTMLWorker however this has been replaced with XMLWorker (Java / .Net). Zero work is being done on HTMLWorker which doesn't support CSS files and has only limited support for the most basic CSS properties and actually breaks on certain tags. If you do not see the HTML attribute or CSS property and value in this file then it probably isn't supported by HTMLWorker. XMLWorker can be more complicated sometimes but those complications also make it more extensible.

下面是 C# 代码,展示了如何将 HTML 标签解析为 iText 抽象,这些抽象会自动添加到您正在处理的文档中.C# 和 Java 非常相似,因此转换它应该相对容易.Example #1 使用内置的 HTMLWorker 来解析 HTML 字符串.由于仅支持内联样式,因此 class="headline" 会被忽略,但其他一切都应该可以正常工作.示例 #2 与第一个相同,只是它使用 XMLWorker 代替.Example #3 也解析了简单的 CSS 示例.

Below is C# code that shows how to parse HTML tags into iText abstractions that get automatically added to the document that you are working on. C# and Java are very similar so it should be relatively easy to convert this. Example #1 uses the built-in HTMLWorker to parse the HTML string. Since only inline styles are supported the class="headline" gets ignored but everything else should actually work. Example #2 is the same as the first except it uses XMLWorker instead. Example #3 also parses the simple CSS example.

//Create a byte array that will eventually hold our final PDF
Byte[] bytes;

//Boilerplate iTextSharp setup here
//Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream()) {

    //Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
    using (var doc = new Document()) {

        //Create a writer that's bound to our PDF abstraction and our stream
        using (var writer = PdfWriter.GetInstance(doc, ms)) {

            //Open the document for writing
            doc.Open();

            //Our sample HTML and CSS
            var example_html = @"<p>This <em>is </em><span class=""headline"" style=""text-decoration: underline;"">some</span> <strong>sample <em> text</em></strong><span style=""color: red;"">!!!</span></p>";
            var example_css = @".headline{font-size:200%}";

            /**************************************************
             * Example #1                                     *
             *                                                *
             * Use the built-in HTMLWorker to parse the HTML. *
             * Only inline CSS is supported.                  *
             * ************************************************/

            //Create a new HTMLWorker bound to our document
            using (var htmlWorker = new iTextSharp.text.html.simpleparser.HTMLWorker(doc)) {

                //HTMLWorker doesn't read a string directly but instead needs a TextReader (which StringReader subclasses)
                using (var sr = new StringReader(example_html)) {

                    //Parse the HTML
                    htmlWorker.Parse(sr);
                }
            }

            /**************************************************
             * Example #2                                     *
             *                                                *
             * Use the XMLWorker to parse the HTML.           *
             * Only inline CSS and absolutely linked          *
             * CSS is supported                               *
             * ************************************************/

            //XMLWorker also reads from a TextReader and not directly from a string
            using (var srHtml = new StringReader(example_html)) {

                //Parse the HTML
                iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, srHtml);
            }

            /**************************************************
             * Example #3                                     *
             *                                                *
             * Use the XMLWorker to parse HTML and CSS        *
             * ************************************************/

            //In order to read CSS as a string we need to switch to a different constructor
            //that takes Streams instead of TextReaders.
            //Below we convert the strings into UTF8 byte array and wrap those in MemoryStreams
            using (var msCss = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_css))) {
                using (var msHtml = new MemoryStream(System.Text.Encoding.UTF8.GetBytes(example_html))) {

                    //Parse the HTML
                    iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, doc, msHtml, msCss);
                }
            }


            doc.Close();
        }
    }

    //After all of the PDF "stuff" above is done and closed but **before** we
    //close the MemoryStream, grab all of the active bytes from the stream
    bytes = ms.ToArray();
}

//Now we just need to do something with those bytes.
//Here I'm writing them to disk but if you were in ASP.Net you might Response.BinaryWrite() them.
//You could also write the bytes to a database in a varbinary() column (but please don't) or you
//could pass them to another function for further PDF processing.
var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf");
System.IO.File.WriteAllBytes(testFile, bytes);

<小时>

2017 年更新

对于 HTML 到 PDF 的需求有好消息.正如这个答案显示W3C 标准css-break-3 将解决问题... 这是一个候选推荐,计划在今年变成最终推荐,经过测试.


2017's update

There are good news for HTML-to-PDF demands. As this answer showed, the W3C standard css-break-3 will solve the problem... It is a Candidate Recommendation with plan to turn into definitive Recommendation this year, after tests.

正如print-css.rocks.

这篇关于如何使用 iTextSharp 将 HTML 转换为 PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆