如何将多个HTML文件解析为单个PDF? [英] How to parse multiple HTML files into a single PDF?

查看:97
本文介绍了如何将多个HTML文件解析为单个PDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用iText将一系列html文件转换为PDF。

I want to use iText to convert a series of html file to PDF.

例如:如果有这些文件:

For instance: if have these files:


  • page1.html

  • page2.html

  • page3.html

  • ...

  • page1.html
  • page2.html
  • page3.html
  • ...

现在我要创建一个PDF文件,其中page1.html是第一页, page2.html是第二页,依此类推......

Now I want to create a single PDF file, where page1.html is the first page, page2.html is the second page, and so on...

我知道如何将单个HTML文件转换为PDF,但我不知道如何将此操作产生的这些不同PDF合并为一个PDF。

I know how to convert a single HTML file to a PDF, but I don't know how to combine these different PDFs resulting from this operation into a single PDF.

推荐答案

开始之前:我我不是C#开发人员,所以我不能在C#中给你一个例子。我编写的所有iText示例都是用Java编写的。幸运的是,iText和iTextSharp始终保持同步。在这个问题的上下文中,你可以放心,任何适用于iText的东西都适用于iTextSharp,但你必须做一些特定于C#的小修改。根据我从C#开发者那里听到的内容,这通常不难实现。

Before we start: I am not a C# developer, so I can not give you an example in C#. All the iText examples I write, are written in Java. Fortunately, iText and iTextSharp are always kept in sync. In the context of this question, you can rest assure that whatever works for iText will also work for iTextSharp, but you'll have to make small adaptations that are specific to C#. From what I hear from C# developers, this is usually not hard to achieve.

关于答案:有两个答案和答案#2通常比答案#1更好,但我给出了两种选择,因为可能存在特定情况,答案#1更好。

Regarding the answer: there are two answers and answer #2 is generally better than answer #1, but I'm giving both options because there may be specific cases where answer #1 is better.

测试数据:我创建了3个简单的HTML文件,每个文件都包含一些美国州的信息:

Test data: I have created 3 simple HTML files, each containing some info about a State in the US:

  • page1.html: California
  • page2.html: New York
  • page3.html: Massachusetts

我们将使用XML Worker来解析这三个文件,因此我们需要一个PDF文件。

We are going to use XML Worker to parse these three files and we want a single PDF file as a result.

答案#1:参见 ParseMultipleHtmlFiles1 获取完整的代码示例和 multiple_html_pages1.pdf

Answer #1: see ParseMultipleHtmlFiles1 for the full code sample and multiple_html_pages1.pdf for the resulting PDF.

您说您已成功将一个HTML文件转换为一个PDF文件。假设你是这样做的:

You say that you already succeeded in converting one HTML file into one PDF files. It is assumed that you did it like this:

public byte[] parseHtml(String html) throws DocumentException, IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    // step 1
    Document document = new Document();
    // step 2
    PdfWriter writer = PdfWriter.getInstance(document, baos);
    // step 3
    document.open();
    // step 4
    XMLWorkerHelper.getInstance().parseXHtml(writer, document,
            new FileInputStream(html));
    // step 5
    document.close();
    // return the bytes of the PDF
    return baos.toByteArray();
}

这不是解析HTML文件的最有效方法(还有其他方法)网站上的示例),但这是最简单的方法。

This is not the most efficient way to parse an HTML file (there are other examples on the web site), but it's the simplest way.

如您所见,此方法将HTML解析为PDF文件并在表单中返回该PDF文件一个 byte [] 。由于我们要创建单个PDF,我们可以将此字节数组提供给 PdfCopy 实例,以便我们可以连接多个文档。

As you can see, this method parse an HTML into a PDF file and returns that PDF file in the form of a byte[]. As we want to create a single PDF, we can feed this byte array to a PdfCopy instance, so that we can concatenate multiple documents.

假设我们有三个文件:

public static final String[] HTML = {
    "resources/xml/page1.html",
    "resources/xml/page2.html",
    "resources/xml/page3.html"
};

我们可以遍历这三个文档,逐个解析为 byte [] ,使用PDF字节创建 PdfReader 实例,并将文档添加到 PdfCopy 实例使用 addDocument()方法:

We can loop over these three documents, parse them one by one to a byte[], create a PdfReader instance with the PDF bytes, and add the document to the PdfCopy instance using the addDocument() method:

public void createPdf(String file) throws IOException, DocumentException {
    Document document = new Document();
    PdfCopy copy = new PdfCopy(document, new FileOutputStream(file));
    document.open();
    PdfReader reader;
    for (String html : HTML) {
        reader = new PdfReader(parseHtml(html));
        copy.addDocument(reader);
        reader.close();
    }
    document.close();
} 

这解决了您的问题,但为什么我认为它不是最佳解决方案?

This solves your problem, but why do I think it's not the optimal solution?

假设您需要使用需要嵌入的特殊字体。在这种情况下,每个单独的PDF文件都将包含该字体的子集。不同的文件将需要不同的字体子集,并且 PdfCopy (也不是 PdfSmartCopy )可以合并字体子集。这可能导致一个膨胀的PDF文件,其中包含太多相同字体的字体子集。

Suppose that you need to use a special font that needs to be embedded. In that case, every separate PDF file will contain a subset of that font. Different files will require different font subsets, and PdfCopy (nor PdfSmartCopy for that matter) can merge font subsets. This could result in a bloated PDF file with way too many font subsets of the same font.

我们如何解决这个问题?这在答案#2中有解释。

How do we solve this? That's explained in answer #2.

答案#2:参见 ParseMultipleHtmlFiles2 获取完整的代码示例和 multiple_html_pages2 .pdf 用于生成PDF。您已经看到文件大小的差异:4.61 KB与5.05 KB(我们甚至没有引入嵌入字体)。

Answer #2: See ParseMultipleHtmlFiles2 for the full code sample and multiple_html_pages2.pdf for the resulting PDF. You already see the difference in file size: 4.61 KB versus 5.05 KB (and we didn't even introduce embedded fonts).

在这种情况下,我们不解析将HTML转换为PDF文件,就像我们在答案#1中的 parseHtml()方法中所做的那样。相反,我们使用 parseToElementList()方法将HTML解析为iText ElementList 。此方法需要两个 String s。一个包含HTML代码,另一个包含CSS值。

In this case, we don't parse the HTML to a PDF file the way we did in the parseHtml() method from answer #1. Instead, we parse the HTML to an iText ElementList using the parseToElementList() method. This method requires two Strings. One containing the HTML code, the other one containing CSS values.

我们使用实用程序方法将HTML文件读入 String 。至于CSS值,我们可以将 null 传递给 parseToElementList(),但在这种情况下,默认样式将是忽略。你会注意到,如果你没有传递 default.css,我们在HTML中引入的< h1> 标签看起来会完全不同XML Worker附带的/ code>。

We use a utility method to read the HTML file into a String. As for the CSS value, we could pass null to parseToElementList(), but in that case, default styles will be ignored. You'll notice that the <h1> tag we introduced in our HTML will look completely different if you don't pass the default.css that is shipped with XML Worker.

长话短说,这是代码:

public void createPdf(String file) throws IOException, DocumentException {
    Document document = new Document();
    PdfWriter.getInstance(document, new FileOutputStream(file));
    document.open();
    String css = readCSS();
    for (String htmlfile : HTML) {
        String html = Utilities.readFileToString(htmlfile);
        ElementList list = XMLWorkerHelper.parseToElementList(html, css);
        for (Element e : list) {
            document.add(e);
        }
        document.newPage();
    }
    document.close();
}

我们创建一个凭证和一个 PdfWriter 实例。我们将不同的HTML文件逐个解析为 ElementList ,然后我们将所有元素添加到 Document

We create a single Document and a single PdfWriter instance. We parse the different HTML files into ElementLists one by one, and we add all the elements to the Document.

如果你想要一个新的页面,每次解析一个新的HTML文件时,我都会引入一个 document.newPage()。如果删除此行,则可以在一个页面上添加三个HTML页面(如果您选择回答#1,这将无法实现)。

As you want a new page, each time a new HTML file is parsed, I introduced a document.newPage(). If you remove this line, you can add the three HTML pages on a single page (which wouldn't be possible if you would opt for answer #1).

这篇关于如何将多个HTML文件解析为单个PDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆