我需要一个强大的Web刮板库 [英] I need a Powerful Web Scraper library

查看:265
本文介绍了我需要一个强大的Web刮板库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从Web挖掘内容的功能强大的网络刮板库。这可以付费或免费两者将是为我好。请建议我一个图书馆或更好的方法挖掘在我的preferred数据库中的数据和存储。我已经搜查,但我没有找到这什么好的解决办法。我需要从专家的一个很好的建议。请帮我。

I need a powerful web scraper library for mining contents from web. That can be paid or free both will be fine for me. Please suggest me a library or better way for mining the data and store in my preferred database. I have searched but i didn't find any good solution for this. I need a good suggestion from experts. Please help me out.

推荐答案

刮痧很容易真的,你只需要解析您正在下载的内容,并得到所有相关链接。

Scraping is easy really, you just have to parse the content you are downloading and get all the associated links.

最重要的一条是,虽然处理HTML的一部分。因为大多数的浏览器不需要清洁(或标准兼容),以便HTML中呈现,你需要一个HTML解析器,将是能够做出这样的并不总是良好的HTML的感觉。

The most important piece though is the part that processes the HTML. Because most browsers don't require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going to be able to make sense of HTML that is not always well-formed.

我建议你使用 HTML敏捷性包的用于这一目的。它确实非常好,在处理非结构良好的HTML,并提供了一​​个简单的界面让你使用XPath查询来获取节点生成的文档中

I recommend you use the HTML Agility Pack for this purpose. It does very well at handling non-well-formed HTML, and provides an easy interface for you to use XPath queries to get nodes in the resulting document.

除此之外,你只需要选择一个数据存储来保存处理后的数据(你可以使用任何数据库技术的)和一个方式从网上下载,而.NET提供了两个高级别机制的内容,在 WebClient的和的HttpWebRequest /的 HttpWebResponse 类。

Beyond that, you just need to pick a data store to hold your processed data (you can use any database technology for that) and a way to download content from the web, which .NET provides two high-level mechanisms for, the WebClient and HttpWebRequest/HttpWebResponse classes.

这篇关于我需要一个强大的Web刮板库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆