使用apach tika解析器从XML文件中的xml标签提取文本 [英] extract text from xml tags in an XML file using apach tika parser

查看：190 发布时间：2020/9/4 23:03:52 xml apache-tika

本文介绍了使用apach tika解析器从XML文件中的xml标签提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从各种文档中提取所有文本. 为此，我使用的是Apache Tika 1.4.

I am trying to extract all the text out of various documents. And for that I am using Apache Tika 1.4.

RecursiveTikaParser parser = new RecursiveTikaParser(new AutoDetectParser());
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class, parser);

RecursiveTikaParser此处只是AutoDetectParser的包装.

RecursiveTikaParser here is just a wrapper on AutoDetectParser.

解析方法，类似于这样-

Parse method for which is something like this -

ContentHandler content = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
super.parse(stream, content, metadata, context);
System.out.println("Parsed text is " + content.toString());

现在，此代码必须能够处理多个文件，所以这就是为什么我使用AutoDetectParser()

Now, this code has to be able to handle multiple files so that's why I am using AutoDetectParser()

我在测试中注意到给定一个xml文件-我只能提取标记之间的文本，而不能提取注释，标记.

I noticed in my testing that given an xml file - I can only extract the text that is between the tags and not the comments, tags.

是否可以使用当前方法从文本文件中提取所有内容?

Is it possible to extract everything from the text file with my current approach ?

推荐答案

尝试这样

    Metadata metadata = new Metadata();
    stream = TikaInputStream.get(stream, null);
    String mimtType = DETECTOR.detect(stream, metadata).toString();
    Parser parser;
    if (mimtType.equalsIgnoreCase("application/xml")) {
        parser = new TXTParser();
    } else {
        parser = new AutoDetectParser();
    }

    ContentHandler content = new BodyContentHandler();
    parser.parse(stream, content, metadata, new ParseContext());
    System.out.println(content.toString());

这篇关于使用apach tika解析器从XML文件中的xml标签提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用apach tika解析器从XML文件中的xml标签提取文本 [英] extract text from xml tags in an XML file using apach tika parser

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用apach tika解析器从XML文件中的xml标签提取文本 [英] extract text from xml tags in an XML file using apach tika parser

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭