用java html解析器进行文本提取 [英] Text extraction with java html parsers

查看:105
本文介绍了用java html解析器进行文本提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用一个很好用的html解析器来完成下面的工作:


  1. 提取文本(这是最重要的)

  2. 提取链接,meta关键字

  3. 重新构建原始文档(可选但不错的功能)


  4. 从我迄今为止的调查来看,似乎 jericho 以适应。任何其他开源库,你们会推荐?

    解决方案

    我最近尝试了HtmlCleaner和Cyber​​NekoHtml。 Cyber​​NekoHtml是一个DOM / SAX解析器,可以产生可预测的结果。 HtmlCleaner稍微快点,但往往无法产生准确的结果。



    我会推荐Cyber​​NekoHtml。 Cyber​​NekoHtml可以完成您提到的所有事情。例如,提取所有元素及其属性的列表非常简单。如果你想重建页面,遍历DOM树将每个元素重新构建为HTML。



    这里有一个开源的java html解析器列表:
    http://java-source.net/open-source/html-parsers


    I want to use an html parser that does the following in a nice, elegant way

    1. Extract text (this is most important)
    2. Extract links, meta keywords
    3. Reconstruct original doc (optional but nice feature to have)

    From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?

    解决方案

    I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results.

    I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you mentioned. It is very easy to extract a list of all elements, and their attributes, for example. It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page.

    There's a list of open source java html parsers here: http://java-source.net/open-source/html-parsers

    这篇关于用java html解析器进行文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆