从网站提取信息 [英] Extracting Information from websites

查看:98
本文介绍了从网站提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

不是每个网站都能很好地显示他们的数据,包括XML提要,API等。

我怎么能从网站上提取信息?例如:

  ... 
< div>
< div>
< span id =important-data>这里的信息< / span>
< / div>
< / div>
...

我来自使用Apache XMLBeans进行Java编程和编码的背景。有什么类似的解析HTML,当我知道结构和数据是在一个已知的标记之间?



谢谢

解决方案

有几种开源的HTML解析器用于Java。



我已经使用 JTidy ,并且祝你好运。它会给你一个HTML页面的DOM,你应该可以从那里获取你需要的标签。


Not every website exposes their data well, with XML feeds, APIs, etc

How could I go about extracting information from a website? For example:

...
<div>
  <div>
    <span id="important-data">information here</span>
  </div>
</div>
...

I come from a background of Java programming and coding with Apache XMLBeans. Is there anything similar to parse HTML, when I know the structure and the data is between a known tag?

Thanks

解决方案

There are several Open Source HTML Parsers out there for Java.

I have used JTidy in the past, and have had good luck with it. It will give you a DOM of the html page, and you should be able to grab the tags you need from there.

这篇关于从网站提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆