如何检测一个文档中的图像 [英] How to detect image in a document
问题描述
我如何检测文档中的图片说,DOC,XLS,PPT或PDF?
How can I detect images in a document say doc,xls,ppt or pdf ?
我与Apache提卡遇到了,我想它的命令行选项。
http://tika.apache.org/1.2/gettingstarted.html
I came across with Apache Tika, I am trying its command line option. http://tika.apache.org/1.2/gettingstarted.html
但不能肯定它会怎样检测图像。
But not quite sure how it will detect images.
任何帮助是AP preciated。
Any help is appreciated.
感谢
推荐答案
您说过要使用命令行的解决方案,并没有写任何Java code,所以它不会是prettiest办法做到这一点......如果你很高兴编写Java的一点点,并创建一个新的程序从Python中调用,那么你就可以做到这一点更漂亮!
You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!
做的第一件事是让蒂卡App中的文件中提取出任何嵌入的资源。使用 - 此提取物
选项,并提取发生在你应用控制一个特殊的临时目录,如:
The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract
option for this, and have the extraction occur in a special temp directory you app controls, eg
$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)
如果你能
抓斗提取的输出,并解析寻找图像(但要注意,一些图像有一个应用程序/
preFIX他们canconical MIMETYPE!)。您可能需要运行一些第二--detect一步,我不知道,测试分析器是如何得到的提取。
Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/
prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.
现在,如果有图像,他们将在您的测试目录。只要你想处理它们。最后,ZAP公司的临时目录,当你的文件完成!
Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!
这篇关于如何检测一个文档中的图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!