是否有类似于lxml或nokogiri的库? [英] Is there a library similar to lxml or nokogiri for Java?

查看:266
本文介绍了是否有类似于lxml或nokogiri的库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做一些屏幕抓取,理想情况下使用CSS选择器而不是XPath。是否有一个类似于Ruby或Python的库?

I want to do some screen scraping, ideally using CSS selectors and not XPath. Is there a library similar to ones in Ruby or Python?

推荐答案

有十几个用Java编写的屏幕抓取库。仅举几例:

There are dozen of screen scraping library written in Java. Just to cite a few :



  • TagSoup - 一个用Java编写的兼容SAX的解析器,而不是
    解析格式正确或有效的XML,
    解析在
    狂野中发现的HTML:令人讨厌和野蛮,虽然相当于
    ,但通常很短。 TagSoup是
    ,专为那些需要
    处理这些东西的人设计,使用一些理性应用
    设计的
    外观。通过提供SAX接口,
    它允许标准XML工具应用于
    甚至是最差的HTML。

  • Jericho HTML Parser - Jericho HTML Parser是一个简单但功能强大的
    java库,允许分析和
    操作部分一个HTML
    文档,包括一些常见的
    服务器端标记,同时再现
    逐字任何无法识别或无效的
    HTML。它还提供高级HTML
    表单操作函数。 t是
    既不是事件也不是基于树的
    解析器,而是使用组合
    的简单文本搜索,高效标记
    识别和标记位置缓存。
    整个源文档
    的文本首先被加载到内存中,然后
    只有相关的段搜索
    ,用于每个
    搜索操作的相关字符。 / li>
  • HTML Cleaner - HtmlCleaner重新排序个别元素,
    产生来自脏
    HTML的格式良好的XML。它遵循类似的规则,即大多数Web浏览器使用
    来创建文档对象模型。
    用户可以为标签过滤和平衡提供自定义标签和规则

  • NekoHTML - NekoHTML是一个简单的HTML扫描器和标签平衡器,
    使应用程序员能够
    解析HTML文档并访问使用标准XML
    接口的
    信息。解析器可以扫描HTML
    文件并修复人类(和计算机)
    作者在编写HTML
    文档时所犯的许多常见
    错误。 NekoHTML添加了缺少的
    父元素;使用可选的结束标记自动关闭
    元素;和
    可以处理不匹配的内联元素
    标记。

  • TagSoup - a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.
  • Jericho HTML Parser - Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions. t is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
  • HTML Cleaner - HtmlCleaner reorders individual elements and produces well-formed XML from dirty HTML. It follows similar rules that the most of web-browsers use in order to create document object model. A user may provide custom tag and rule set for tag filtering and balancing.
  • NekoHTML - NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

还有更多在用Java编写的HTML屏幕抓取工具 。但是正如我在上一个答案。这对您来说可能不是问题。

And many more at HTML Screen Scraping Tools written in Java. But these are IMO the best to deal with any kind of content (understand all kind of crap) as I mentioned in this previous answer. This might not be an issue for you though.

以防万一,也许可以查看帖子 Nokogiri纯Java状态

Just in case, maybe check out the thread Nokogiri pure Java status.

更新:已发布新项目(2010-01-31), jsoup ,该项目提供了 selector-syntax to find elements 。有关详细信息,请参阅其网站和/或此答案

Update: A new project has been released (the 2010-01-31), jsoup, which offers a selector-syntax to find elements. See its website for more details and/or this answer from its author.

这篇关于是否有类似于lxml或nokogiri的库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆