如何从PDF文件中删除所有图像/绘图并仅以Java格式保留文本? [英] How can I remove all images/drawings from a PDF file and leave text only in Java?

查看:882
本文介绍了如何从PDF文件中删除所有图像/绘图并仅以Java格式保留文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PDF文件,它是OCR处理器的输出,这个OCR处理器识别图像,将文本添加到pdf但最后放置一个低质量的图像而不是原始图像(我不知道为什么有人会这样做,但是他们会这样做。)

I have a PDF file that's an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instead of the original one (I have no idea why anyone would do that, but they do).

所以,我想得到这个PDF,删除图像流并保留文本,这样我就可以得到它并导入(使用iText页面导入功能)到PDF我用自己的真实图像创建自己。

So, I would like to get this PDF, remove the image stream and leave the text alone, so that I could get it and import (using iText page importing feature) to a PDF I'm creating myself with the real image.

在有人要求之前,我已经尝试过使用其他工具提取文本坐标(JPedal),但是当我在PDF上绘制文本时,它与原始文本的位置不同。

And before someone asks, I have already tried to use another tool to extract text coordinates (JPedal) but when I draw the text on my PDF it isn't at the same position as the original one.

我宁愿这样做用Java完成,但如果其他工具可以做得更好,请告诉我。它可能只是图像删除,我可以使用带有图纸的PDF格式。

I'd rather have this done in Java, but if another tool can do it better, just let me know. And it could be image removal only, I can live with a PDF with the drawings in there.

推荐答案

我使用了Apache PDFBox类似的情况。

I used Apache PDFBox in similar situation.

为了更具体一点,尝试类似的事情:

To be a little bit more specific, try something like that:

import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;

public class Main {
    public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
        PDDocument document = PDDocument.load("input.pdf");

        if (document.isEncrypted()) {
            document.decrypt("");
        }

        PDDocumentCatalog catalog = document.getDocumentCatalog();
        for (Object pageObj :  catalog.getAllPages()) {
            PDPage page = (PDPage) pageObj;
            PDResources resources = page.findResources();
            resources.getImages().clear();
        }

        document.save("strippedOfImages.pdf");
    }
}

它应该删除所有类型的图像(png, jpeg,...)。它应该是这样的:

It's supposed to remove all types of images (png, jpeg, ...). It should work like that:

示例文章http:// s3 .postimage.org / 28f6boykk / before.jpg

这篇关于如何从PDF文件中删除所有图像/绘图并仅以Java格式保留文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆