如何使用 Tika 从 html 中提取主要文本 [英] how to extract main text from html using Tika

查看：37 发布时间：2021/11/14 23:45:26 html-parsing apache-tika boilerpipe

本文介绍了如何使用 Tika 从 html 中提取主要文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我只想知道如何使用 Tika 从 html 中提取主要文本和纯文本?

I just want to know that how i can extract main text and plain text from html using Tika?

也许一种可能的解决方案是使用 BoilerPipeContentHandler，但您是否有一些示例/演示代码来展示它?

maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it?

非常感谢

推荐答案

这是一个示例:

public String[] tika_autoParser() {
    String[] result = new String[3];
    try {
        InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf"));
        ContentHandler textHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        AutoDetectParser parser = new AutoDetectParser();
        ParseContext context = new ParseContext();
        parser.parse(input, textHandler, metadata, context);
        result[0] = "Title: " + metadata.get(metadata.TITLE);
        result[1] = "Body: " + textHandler.toString();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } catch (SAXException e) {
        e.printStackTrace();
    } catch (TikaException e) {
        e.printStackTrace();
    }

    return result;
}

这篇关于如何使用 Tika 从 html 中提取主要文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 Tika 从 html 中提取主要文本 [英] how to extract main text from html using Tika

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用 Tika 从 html 中提取主要文本 [英] how to extract main text from html using Tika

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭