使用apache tika在doc文件中获取嵌入式资源 [英] get embedded resourses in doc files using apache tika

查看:181
本文介绍了使用apache tika在doc文件中获取嵌入式资源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有包含文字和图片的Word文档。我想解析它们以获得xml结构。在研究之后,我最终使用apache tika来转换我的文档。我可以将我的doc解析为xml。这是我的代码:

I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code:

AutoDetectParser parser=new AutoDetectParser();
InputStream input=new FileInputStream(new File("1.docx"));
Metadata metadata = new Metadata();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));

parser.parse(input, handler, metadata, new ParseContext());
String xhtml = sw.toString();

我想从文档中提取图像并将其转换为二进制格式。我不知道如何从文档中提取嵌入的资源。

I want to extract images from document and convert them to binary format. I don't know how to extract embedded resources from document.

推荐答案

您需要定义自己的类来实现 Parser 并将其附加到解析外部文档时提供的 ParseContext 。然后,您的Parser将被调用所有嵌入式资源,如果您想要将它们保存出来

You need to define your own class which implements Parser and attach that to the ParseContext you supply when parsing the outer document. Your Parser will then be called for all embedded resources, allowing you to save them out if you want to

我能想到的最好的例子是Tika CLI ,由 -z (提取)标志使用。如果你查看 TikaCLI的源代码,您正在寻找 FileEmbeddedDocumentExtractor 作为示例。

The best example I can think of for this is in the Tika CLI, as used by the -z (extract) flag. If you look in the source code for TikaCLI, you're looking for the FileEmbeddedDocumentExtractor as your example.

最简单的代码如下:

final AutoDetectParser parser = new AutoDetectParser();

public class ExtractParser extends AbstractParser {
   private int att = 0;
   public Set<MediaType> getSupportedTypes(ParseContext context) {
     // Everything AutoDetect parser does
     return parser.getSupportedTypes(context);
   }
   public void parse(
        InputStream stream, ContentHandler handler,
        Metadata metadata, ParseContext context)
        throws IOException, SAXException, TikaException {
      // Stream to a new file
      File f = new File("out-" + (++att) + ".bin");
      FileOutputStream fout = new FileOutputStream(f);
      IOUtils.copy(strea, fout);
      fout.closee();
   }
}

InputStream input = new FileInputStream(new File("1.docx"));
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, extractParser);
parser.parse(input, handler, metadata, context);

您还可以使用 EmbeddedDocumentExtractor 界面你宁愿,如果最好直接使用Parser,取决于你想做什么

You can also use the EmbeddedDocumentExtractor interface if you'd rather, depends on what you want to do if it's better to use Parser directly

这篇关于使用apache tika在doc文件中获取嵌入式资源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆