PDFBox pdf到图像生成重叠文本 [英] PDFBox pdf to image generates overlapping text

查看:513
本文介绍了PDFBox pdf到图像生成重叠文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于副项目,我开始使用PDFBox将pdf文件转换为图像.这是我用来转换为图像文件 https://bitcoin.org/bitcoin.pdf.

For a side project I started using PDFBox to convert pdf file to image. This is the pdf file I am using to convert to image file https://bitcoin.org/bitcoin.pdf.

这是我正在使用的代码.这是调用PDFToImage的非常简单的代码.但是,输出的jpg图像文件看起来很糟糕,插入了许多逗号并重叠了一些文字.

This is the code I am using. It is very simple code which calls PDFToImage. But the output jpg image file looks really bad with lot of commas inserted and some overlapping text.

    String [] args_2 =  new String[7];
    String pdfPath = "C:\\bitcoin.pdf";
    args_2[0] = "-startPage";
    args_2[1] = "1";
    args_2[2] = "-endPage";
    args_2[3] = "1";
    args_2[4] = "-outputPrefix";
    args_2[5] = "my_image_2";
    //args_2[6] = "-resolution";
    //args_2[7] = "1000";
    args_2[6] = pdfPath;
    try {
        PDFToImage.main(args_2);
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

推荐答案

如果您查看日志记录输出(也许您需要在环境中激活日志记录).您会看到许多类似的条目(使用PDFBox 1.8.5生成):

If you look at the logging outputs (maybe you need to activate logging in your environment). you'll see many entries like these (generated using PDFBox 1.8.5):

Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <t> from <Century Schoolbook Fett> to the default font
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <S> from <Times New Roman> to the default font
Jun 16, 2014 8:40:46 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <c> from <Arial> to the default font
Jun 16, 2014 8:40:52 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <i> from <Courier New> to the default font

因此,PDFBox使用与PDF指示的字体不同的字体来呈现其文本.这说明了插入的许多逗号重叠的文字:

So PDFBox uses different fonts than the fonts indicated by the PDF for rendering the text of it. This explains both the lots of commas inserted and the overlapping text:

  1. 不同的字体可能具有不同的编码.看起来您的样本PDF使用的编码带有逗号,其中PDFBox假定的默认字体带有空格字符;
  2. 不同的字体具有不同的字形宽度.在样本PDF中,不同的字形宽度会导致文本重叠.

这导致

所有这些的原因是PDFBox 1.8.x不能正确支持所有字体进行渲染.您可能想尝试使用PDFBox 2.0.0-SNAPSHOT,它是当前正在开发的新PDFBox.但是请注意,用于渲染的类已更改.

The reason for all this is that PDFBox 1.8.x does not properly support all kinds of fonts for rendering. You might want to try PDFBox 2.0.0-SNAPSHOT, the new PDFBox currently under development, instead. Be aware, though, the classes for rendering have been changed.

使用PDFBox 2.0.0-SNAPSHOT的当前状态(2014年6月中旬),您可以呈现以下PDF:

Using the current (mid-June 2014) state of PDFBox 2.0.0-SNAPSHOT you can render PDFs like this:

PDDocument document = PDDocument.loadNonSeq(resource, null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
@SuppressWarnings("unchecked")
List<PDPage> pages = catalog.getAllPages();

PDFRenderer renderer = new PDFRenderer(document);

for (int i = 0; i < pages.size(); i++)
{
    BufferedImage image = renderer.renderImage(i);
    ImageIO.write(image, "png", new File("bitcoin-convertToImage-" + i + ".png"));
}

此代码的结果是:

其他PDFRenderer.renderImage重载允许您显式设置所需的分辨率.

Other PDFRenderer.renderImage overloads allow you to explicitly set the desired resolution.

PS:根据 Tilman Hausherr 的建议,您可能希望替换ImageIO.write致电

PS: As proposed by Tilman Hausherr you may want to replace the ImageIO.write call by

    ImageIOUtil.writeImage(image, "bitcoin-convertToImage-" + i + ".png", 72);

ImageIOUtil是PDFBox帮助器类,它试图优化ImageIO编写器的选择并向图像文件添加DPI属性.

ImageIOUtil is a PDFBox helper class which tries to optimize the selection of the ImageIO writer and to add a DPI attribute to the image file.

如果使用其他PDFRenderer.renderImage重载来设置分辨率,请记住在此处相应地更改最终参数72.

If you use a different PDFRenderer.renderImage overload to set a resolution, remember to change the final parameter 72 here accordingly.

这篇关于PDFBox pdf到图像生成重叠文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆