我需要一个强大的网络爬虫库 [英] I need a Powerful Web Scraper library

查看:42
本文介绍了我需要一个强大的网络爬虫库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个强大的网络爬虫库来从网络中挖掘内容.可以付费也可以免费,对我来说都可以.请建议我一个库或更好的方法来挖掘数据并存储在我喜欢的数据库中.我已经搜索过,但我没有找到任何好的解决方案.我需要专家的好建议.请帮帮我.

I need a powerful web scraper library for mining contents from web. That can be paid or free both will be fine for me. Please suggest me a library or better way for mining the data and store in my preferred database. I have searched but i didn't find any good solution for this. I need a good suggestion from experts. Please help me out.

推荐答案

抓取真的很容易,你只需要解析你正在下载的内容并获取所有关联的链接.

Scraping is easy really, you just have to parse the content you are downloading and get all the associated links.

最重要的部分是处理 HTML 的部分.由于大多数浏览器不需要最干净(或符合标准)的 HTML 才能呈现,因此您需要一个 HTML 解析器,它能够理解并非总是格式良好的 HTML.

The most important piece though is the part that processes the HTML. Because most browsers don't require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going to be able to make sense of HTML that is not always well-formed.

为此,我建议您使用 HTML Agility Pack.它在处理格式不正确的 HTML 方面做得非常好,并为您提供了一个简单的界面,让您可以使用 XPath 查询来获取结果文档中的节点.

I recommend you use the HTML Agility Pack for this purpose. It does very well at handling non-well-formed HTML, and provides an easy interface for you to use XPath queries to get nodes in the resulting document.

除此之外,您只需要选择一个数据存储来保存您处理过的数据(您可以使用任何数据库技术)和一种从网络下载内容的方法,.NET 提供了两种高级机制,WebClientHttpWebRequest/HttpWebResponse 类.

Beyond that, you just need to pick a data store to hold your processed data (you can use any database technology for that) and a way to download content from the web, which .NET provides two high-level mechanisms for, the WebClient and HttpWebRequest/HttpWebResponse classes.

这篇关于我需要一个强大的网络爬虫库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆