如何检测一个文档中的图像 [英] How to detect image in a document

查看:162
本文介绍了如何检测一个文档中的图像的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何检测文档中的图片说,DOC,XLS,PPT或PDF?

How can I detect images in a document say doc,xls,ppt or pdf ?

我与Apache提卡遇到了,我想它的命令行选项。
http://tika.apache.org/1.2/gettingstarted.html

I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html

但不能肯定它会怎样检测图像。

But not quite sure how it will detect images.

任何帮助是AP preciated。

Any help is appreciated.

感谢

推荐答案

您说过要使用命令行的解决方案,并没有写任何Java code,所以它不会是prettiest办法做到这一点......如果你很高兴编写Java的一点点,并创建一个新的程序从Python中调用,那么你就可以做到这一点更漂亮!

You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!

做的第一件事是让蒂卡App中的文件中提取出任何嵌入的资源。使用 - 此提取物选项,并提取发生在你应用控制一个特殊的临时目录,如:

The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg

$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)

如果你能

抓斗提取的输出,并解析寻找图像(但要注意,一些图像有一个应用程序/ preFIX他们canconical MIMETYPE!)。您可能需要运行一些第二--detect一步,我不知道,测试分析器是如何得到的提取。

Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.

现在,如果有图像,他们将在您的测试目录。只要你想处理它们。最后,ZAP公司的临时目录,当你的文件完成!

Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!

这篇关于如何检测一个文档中的图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆