pdfbox和itext使用不正确的dpi提取图像 [英] pdfbox and itext extracting image with incorrect dpi

查看:336
本文介绍了pdfbox和itext使用不正确的dpi提取图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用pdfbox提取图像时,对于某些PDF,我得到的图像dpi不正确。当我使用Photoshop或Acrobat Reader Pro提取图像时,我可以看到使用Windows照片查看器的图像的dpi是200,但是当我使用pdfbox提取图像时,dpi是72。

When I extract an image using pdfbox I am getting incorrect dpi of the image for some PDFs. When I extract an image using Photoshop or Acrobat Reader Pro I can see that the dpi of the image is 200 using windows photo viewer, but when I extract the image using pdfbox the dpi is 72.

为了提取图像我使用下面的代码:
无法从PDFA1中提取图像 - 格式文档

For extracting the image I am using following code : Not able to extract images from PDFA1-a format document

当我查看日志时,我看到一个不寻常的条目:
2015-01-23-main - DEBUG-org.apache.pdfbox.util.TIFFUtil:

When I check the logs I see an unusual entry: 2015-01-23-main--DEBUG-org.apache.pdfbox.util.TIFFUtil:

     <?xml version="1.0" encoding="UTF-8"?><javax_imageio_jpeg_image_1.0>
      <JPEGvariety>
    <app0JFIF majorVersion="1" minorVersion="2" resUnits="0" Xdensity="1" Ydensity="1" thumbWidth="0" thumbHeight="0"/>
  </JPEGvariety>
  <markerSequence>
    <dqt>
      <dqtable elementPrecision="0" qtableId="0"/>
      <dqtable elementPrecision="0" qtableId="1"/>
    </dqt>
    <dht>
      <dhtable class="0" htableId="0"/>
      <dhtable class="0" htableId="1"/>
      <dhtable class="1" htableId="0"/>
      <dhtable class="1" htableId="1"/>
    </dht>
    <sof process="0" samplePrecision="8" numLines="0" samplesPerLine="0" numFrameComponents="3">
      <componentSpec componentId="1" HsamplingFactor="2" VsamplingFactor="2" QtableSelector="0"/>
      <componentSpec componentId="2" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/>
      <componentSpec componentId="3" HsamplingFactor="1" VsamplingFactor="1" QtableSelector="1"/>
    </sof>
    <sos numScanComponents="3" startSpectralSelection="0" endSpectralSelection="63" approxHigh="0" approxLow="0">
      <scanComponentSpec componentSelector="1" dcHuffTable="0" acHuffTable="0"/>
      <scanComponentSpec componentSelector="2" dcHuffTable="1" acHuffTable="1"/>
      <scanComponentSpec componentSelector="3" dcHuffTable="1" acHuffTable="1"/>
    </sos>
  </markerSequence>
</javax_imageio_jpeg_image_1.0>

我试图谷歌,但我可以看到通过此日志找出pdfbox的含义。这是什么意思?

I tried to google but I can see to find out what pdfbox means by this log. What does this mean?

您可以从以下链接下载带有此问题的示例pdf:
http://myslams.com/test/1.pdf

You can download a sample pdf with this problem from this link: http://myslams.com/test/1.pdf

我甚至试过了itext但它用96 dpi提取图像。

I have even tried itext but it is extracting image with 96 dpi.

我做错了什么?或者pdfbox和itext有这个限制吗?

Am I doing something wrong? Or pdfbox and itext have this limitation?

推荐答案

经过一番挖掘后我找到了你的1.pdf。因此,...

After some digging I found your 1.pdf. Thus,...

最近的答案 @Tilman和你正在讨论这个较旧的答案 @Tilman指向 PrintImageLocations PDFBox示例。我为你的文件运行它得到:

In comments to this recent answer @Tilman and you were discussing this older answer in which @Tilman pointed towards the PrintImageLocations PDFBox example. I ran it for your file and got:

Processing page: 0
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 888px
size = 613.44, 319.68
size = 8.52in, 4.44in
size = 216.408mm, 112.776mm

Processing page: 1
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 2800px
size = 613.44, 1008.0
size = 8.52in, 14.0in
size = 216.408mm, 355.6mm

Processing page: 2
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 2800px
size = 613.44, 1008.0
size = 8.52in, 14.0in
size = 216.408mm, 355.6mm

Processing page: 3
*******************************************************************
Found image [Im0]
position = 0.0, 0.0
size = 1704px, 1464px
size = 613.44, 527.04
size = 8.52in, 7.3199997in
size = 216.408mm, 185.928mm

在所有页面上,这相当于x和y方向的200 dpi(1704px / 8.52in = 888px / 4.44in = 2800px / 14.0in = 1464px / 7.32in = 200 dpi )。

On all pages this amounts to 200 dpi both in x and y directions (1704px / 8.52in = 888px / 4.44in = 2800px / 14.0in = 1464px / 7.32in = 200 dpi).

因此PDFBox为您提供了您所追求的dpi值。

So PDFBox gives you the dpi values you are after.

(@ Tilman:当前2.0该示例的.0-SNAPSHOT版本返回完全废话;你可能想解决这个问题。)

(@Tilman: The current 2.0.0-SNAPSHOT version of that sample returns utter nonsense; you might want to fix this.)

该PDFBox示例的简化iText版本将是这个:

A simplified iText version of that PDFBox example would be this:

public void printImageLocations(InputStream stream) throws IOException
{
    PdfReader reader = new PdfReader(stream);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    ImageRenderListener listener = new ImageRenderListener();

    for (int page = 1; page <= reader.getNumberOfPages(); page++)
    {
        System.out.printf("\nPage %s:\n", page);
        parser.processContent(page, listener);
    }
}

static class ImageRenderListener implements RenderListener
{
    public void beginTextBlock() { }
    public void renderText(TextRenderInfo renderInfo) { }
    public void endTextBlock() { }

    public void renderImage(ImageRenderInfo renderInfo)
    {
        try
        {
            PdfDictionary imageDict = renderInfo.getImage().getDictionary();

            float widthPx = imageDict.getAsNumber(PdfName.WIDTH).floatValue(); 
            float heightPx = imageDict.getAsNumber(PdfName.HEIGHT).floatValue();
            float widthUu = renderInfo.getImageCTM().get(Matrix.I11);
            float heigthUu = renderInfo.getImageCTM().get(Matrix.I22);

            System.out.printf("Image %.0fpx*%.0fpx, %.0fuu*%.0fuu, %.2fin*%.2fin\n", widthPx, heightPx, widthUu, heigthUu, widthUu/72, heigthUu/72);
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

(注意:我假设没有旋转和未被歪曲的图像。)

(Beware: I assumed unrotated and unskewed images.)

您的文件的结果:

Page 1:
Image 1704px*888px, 613uu*320uu, 8,52in*4,44in

Page 2:
Image 1704px*2800px, 613uu*1008uu, 8,52in*14,00in

Page 3:
Image 1704px*2800px, 613uu*1008uu, 8,52in*14,00in

Page 4:
Image 1704px*1464px, 613uu*527uu, 8,52in*7,32in

因此,所有200dpi全部沿。因此,iText也会为您提供您所追求的dpi值。

Thus, also 200dpi all along. So iText, too, gives you the dpi values you are after.

显然您引用的代码没有机会在PDF的上下文中报告合理的dpi值,因为它只将图像提取为在资源中找到,但忽略页面上如何使用

Obviously the code you referenced had no chance to report a dpi value sensible in the context of the PDF because it only extracts the images as found in the resources but ignores how the respective image resource is used on the page.

可以拉伸图像资源,旋转,倾斜,...当他在页面内容中使用它时,作者喜欢的任何方式。

An image resource can be stretched, rotated, skewed, ... any way the author likes when he uses it in the page content.

BTW,dpi值只有在作者没有倾斜时才有意义并且仅旋转90°的倍数。

BTW, a dpi value only makes sense if the author did not skew and rotated only by a multiple of 90°.

这篇关于pdfbox和itext使用不正确的dpi提取图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆