解析网页以提取内容 [英] Parsing webpages to extract contents

查看:115
本文介绍了解析网页以提取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想设计一个使用java的抓取工具,抓取网页并提取页面的某些内容。我应该怎么做?我是新手,我需要指导才能开始设计抓取工具。



例如,我想从嵌入内容的网页中访问红色是我最喜欢的颜色的内容如下所示:

< div> 红色是我最喜欢的颜色 < / div>

解决方案

建议读数



静态页面:





请注意,很多页面将在加载后使用JavaScript 动态创建内容。对于这种情况,静态页面方法无济于事,您需要在Web自动化类别中搜索工具。

Selenium 就是这样一个工具集。您可以命令浏览器使用通用浏览器打开和导航页面,甚至可以使用 phantomjs



祝你好运,你面前有很多阅读和编码。



[为示例编辑]



此技术称为网络抓取 - 将其与google一起使用。以下是我搜索结果的示例,我不提供任何保证或认可



对于静态网页报废 - 这里是使用jsoup的示例



对于动态页面 - 这是使用Selenium的示例


I want to design a crawler, using java, that crawls a webpage and extract certain contents of the page. How should I do this? I am new and I need guidance to start designing crawlers.

For example, I want to access the content "red is my favorite color" from a webpage which is embedded something like below:

< div >red is my favorite color< / div >

解决方案

Suggested readings

Static pages:

Mind you, many of the pages will create content dynamically using JavaScript after loading. For such a case, the 'static page' approach won't help, you will need to search for tools in the "Web automation" category.
Selenium is such a toolset. You can command you browser to open and navigate pages using a common browser, you may even be able to use a 'headless browser' (no UI) using the phantomjs.

Good luck, there's lots of reading and coding ahead of you.

[edited for examples]

This technique is called Web scraping - use it with google for examples. The following are offered as an example of results in my searches, I offer no warranties or endorsements for them

For "static Webpage scrapping" - here's an example using jsoup

For "dynamic pages" - here's an example using Selenium

这篇关于解析网页以提取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆