解析网页 [英] Parsing web pages

查看:56
本文介绍了解析网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对解析HTML页面,具体来说是论坛有疑问,我想解析包含某些发布条件的论坛或主题,但我尚未定义算法,因为我以前只有解析结构文本格式,用例可以是将每个线程手动复制并粘贴到程序中,或插入类似 http://www.forums.com/forum/showthread.php?t = 46875& page = 3 并让程序解析页面

I have a question about parsing HTML pages, specificaly forums, i want to parse a forum or thread containing certain post criterias, i havent defined the algorithm yet, since i have only parsed structure text formats before, A use case may be copy and paste each thread into the program by hand, or insert a URL like http://www.forums.com/forum/showthread.php?t=46875&page=3 and let the program parse the pages

给出所有我想知道的信息:

Given all this i would like to know:

  1. 是否可以在HTML页面上解析论坛主题?
  2. 什么是最好/最快/最简单的语言?
  3. 如果我更喜欢Java,我需要什么工具/库?
  4. 还有什么我应该考虑的吗?

推荐答案

1/是

2/使用诸如python或ruby之类的紧凑语言进行原型制作.

2 / Use some compact language like python or ruby for prototyping.

  • 对于python,有一个用于HTML/XML解析的简洁库,称为 beautifulsoup

对于红宝石,您可以尝试: hpricot

For ruby, you could try: nokogiri or hpricot

3/要考虑的Java工具: htmlparser

3 / A Java tool to consider: htmlparser

4/如果您只对某些特定的 text 或某些特殊的 classes 感兴趣,则一个正则表达式可能就足够了.但是,一旦您想更深入地了解内容的结构,就需要某种模型来保存数据,因此需要一个解析器,在最佳情况下,解析器可以应对现实世界中出现的不便之处html.

4 / If you are interested only in some particular text or some special classes, a regular expression might be sufficient. But as soon as you want to dig deeper into the structure of the content, you'll need some kind of model to hold your data, and hence a parser, which, in the best case, can cope with the occuring incosistencies of real world html.

这篇关于解析网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆