iText:从现有PDF导入样式文本和信息 [英] iText: Importing styled Text and informations from an existing PDF

查看:585
本文介绍了iText:从现有PDF导入样式文本和信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用iText生成PDF文件,它工作正常。但是我需要一种方法来在某个时候从现有的PDF中导入HTML样式的信息。
我知道我可以直接使用XMLWorker类在我自己的文档中直接从html生成文本。但是因为我不确定它是否确实支持所有我希望解决这个问题的html功能。
因此,使用XSLT从html生成PDF。这个PDF的内容应该被复制到我的文档中。
书中描述了两种方式(iText in Action)。
分析PDF并使用PdfReaderContentParser和TextExtractionStrategy从文档中获取文本(或其他信息)的文件。
它看起来像这样:

  PdfReader reader = new PdfReader(pdf); 
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy策略;
for(int i = 1; i< = reader.getNumberOfPages(); i ++){
strategy = parser.processContent(i,new LocationTextExtractionStrategy());
document.add(new Chunk(strategy.getResultantText()));
}

但是这只能将纯文本打印到文档中。显然有更多的ExtractionStrategys,也许其中一个正是我想要的,但我还没有找到它。



第二种方法是复制itextpdf.text。 PDF文档每一面的图像。这显然不是一个好主意,因为即使现有PDF中只有一行文本,它也会将整个页面添加到文档中。它的完成是这样的:

  PdfWriter writer = PdfWriter.getInstance(document,new FileOutputStream(RESULT)); 
PdfReader reader = new PdfReader(pdf);
PdfImportedPage页面;
for(int i = 1; i< = reader.getNumberOfPages(); i ++){
page = writer.getImportedPage(reader,i);
document.add(Image.getInstance(page));
}

就像我说的那样,复制了PDF结尾的所有空行,但是我需要在最后一行文字后立即继续我的文字。
如果我可以将此itext.text.Image转换为java.awt.BufferedImage,我可以使用getSubImage();我可以从PDF中提取信息以删除所有空行。但我无法找到办法。



这是我找到的两种方法。但是,因为它们都不适合我的目的,因为它们是我的问题是:
是否有一种方法可以导入除最后空行之外的所有内容,但包括文本样式信息,表格以及PDF中的其他所有内容我的文档使用iText?

解决方案

您可以修剪XSLT生成的PDF的空白空间,然后导入修剪后的页面您的代码。

示例代码



以下代码借鉴了我对



以及 docGraphics 文档中的页面



合并为一个新文档,其中包含之前,之间和之后的一些文本。结果是:





正如您所见,源样式被保留,但周围的空闲空间被丢弃。


I´m generating PDFs using iText and it works fine. But I need a way to import html styled informations from an existing PDF at some point. I know i could just use the XMLWorker class to generate the text directly from html in my own document. But cause I´m not sure whether it actually supports all html features I´m looking to work around this. Therefore a PDF is generated from html using XSLT. The content of this PDF then should be copied to my document. There are two ways discribed in the book ("iText in Action"). One that parses the PDF and gets you the text (or other informations) from the document using PdfReaderContentParser and TextExtractionStrategy. It looks like this:

PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i++){
strategy = parser.processContent(i, new LocationTextExtractionStrategy());
document.add(new Chunk(strategy.getResultantText()));
}

But this only prints plain text to the document. Obviously there are more ExtractionStrategys and maybe one of them does exactly what i want but i couldn´t find it yet.

The second way is to copy an itextpdf.text.Image of each side of the PDF to your document. This is obviously not a good idea, cause it will add the entire page to your document even if there is only one line of text in the existing PDF. Its done like this:

PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
PdfReader reader = new PdfReader(pdf);
PdfImportedPage page;
for(int i=1;i<=reader.getNumberOfPages();i++){
page = writer.getImportedPage(reader,i);
document.add(Image.getInstance(page));
}

Like I said this copys all the empty lines at the end of the PDF aswell, but i need to continue my text immediatly after the last line of text. If I could convert this itext.text.Image into a java.awt.BufferedImage I could use getSubImage(); and informations i can extract from the PDF to cut away all the empty lines. But i wasn´t able to find a way to to this.

This are the two ways i found. But cause none of them is suitable for my purpose as they are my question is: Is there a way to import everything except the empty lines at the end, but including text-style informations, tables and everything else from a PDF to my document using iText?

解决方案

You can trim away empty space of the XSLT generated PDF and then import the trimmed pages as in your code.

Example code

The following code borrows from the code in my answer to Using iTextPDF to trim a page's whitespace. In contrast to the code there, though, we have to manipulate the media box, not the crop box, because this is the only box respected by PdfWriter.getImportedPage.

Before importing a page from a given PdfReader, crop it using this method:

static void cropPdf(PdfReader reader) throws IOException
{
    int n = reader.getNumberOfPages();
    for (int i = 1; i <= n; i++)
    {
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        MarginFinder finder = parser.processContent(i, new MarginFinder());
        Rectangle rect = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());

        PdfDictionary page = reader.getPageN(i);
        page.put(PdfName.MEDIABOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()}));
    }
}

(excerpt from ImportPageWithoutFreeSpace.java)

The extended render listener MarginFinder is taken as is from the question linked to above. You can find a copy here: MarginFinder.java.

Example run

Using this code

PdfReader readerText = new PdfReader(docText);
cropPdf(readerText);
PdfReader readerGraphics = new PdfReader(docGraphics);
cropPdf(readerGraphics);
try (   FileOutputStream fos = new FileOutputStream(new File(RESULT_FOLDER, "importPages.pdf")))
{
    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, fos);
    document.open();
    document.add(new Paragraph("Let's import 'textOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
    document.add(Image.getInstance(writer.getImportedPage(readerText, 1)));
    document.add(new Paragraph("and now 'graphicsOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
    document.add(Image.getInstance(writer.getImportedPage(readerGraphics, 1)));
    document.add(new Paragraph("That's all, folks!", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));

    document.close();
}
finally
{
    readerText.close();
    readerGraphics.close();
}

(excerpt from unit test method testImportPages in ImportPageWithoutFreeSpace.java)

I imported both the page from the docText document

and the page from the docGraphics document

into a new document with some text before, between, and after. The result:

As you can see, source styles are preserved but free space around is discarded.

这篇关于iText:从现有PDF导入样式文本和信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆