屏幕抓取,网页抓取,网站采集,Web数据抽取等使用C#和.NET Framework [英] Screen Scraping, Web Scraping, Web Harvesting, Web Data Extraction, etc. using C# and the .NET Framework

查看:759
本文介绍了屏幕抓取,网页抓取,网站采集,Web数据抽取等使用C#和.NET Framework的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的Microsoft .NET应用程序在C#中的Web收获,网页抓取,网络数据采集,屏幕抓取等,无论你怎么称呼它。对于解析HTML,我试图将HTML敏捷性包,但它不是那么容易,因为我认为这将是。我已经包含了什么我迄今为止的一些规格和图片,并希望得到有关如何我可以继续你的意见。基本上,我想做类似于可视化Web开膛手使用的布局的东西,但我不知道他们是如何做到这一点...任何想法?

I am working on a Microsoft .NET Application in C# for Web Harvesting, Web Scraping, Web Data Extraction, Screen Scraping, etc. whatever you want to call it. For parsing HTML, I'm attempting to incorporate HTML Agility Pack but it's not as easy as I thought it would be. I have included some specifications and images of what I have so far and was hoping to get your opinions on how I could proceed. basically, I want to do something similar to the layout used in Visual Web Ripper but I have no idea how they do it... Any ideas?

图片:

http://img69.imageshack.us/img69/8880/webharvester1.png

http://img198.imageshack.us/img198/9563/webharvester2.png

说明:

我的目标是做一个非常人性化的点和点击从网上下载的数据和图像应用。我想加载使用网络浏览器,并输出解析后的数据和图像链接到文本框中的HTML页面。用户可以指定他们想要的HTML标签,然后下载数据到网格。最后,将数据导出到任何格式,他们所需要的。

My goal is to make a very user friendly point-and-click application for downloading data and images from the web. I would like to load HTML pages using the web browser, and output the parsed data and image links into the text box. The user can specify which HTML tags they want and then download the data into the grid. Finally, export the data into whatever format they need.

我想使用HTML敏捷性包加载HTML网页上,并在文本框中显示。

I'm trying to use HTML Agility Pack to load the HTML on the webpage and display it in the textbox.

    // Load Web Browser
    private void Form6_Load(object sender, EventArgs e)
    {
        // Navigate to webpage
        webBrowser.Navigate("http://www.webopedia.com/TERM/H/HTML.html");

        // Save URL to memory
        SiteMemoryArray[count] = urlTextBox.Text; 

        // Load HTML from webBrowser
        HtmlWindow window = webBrowser.Document.Window; 
        string str = window.Document.Body.OuterHtml;

        // Extract tags using HtmlAgilityPack and display in textbox
        HtmlAgilityPack.HtmlDocument HtmlDoc = new HtmlAgilityPack.HtmlDocument();
        HtmlDoc.LoadHtml(str);

        HtmlAgilityPack.HtmlNodeCollection Nodes = HtmlDoc.DocumentNode.SelectNodes("//a");

        foreach (HtmlAgilityPack.HtmlNode Node in Nodes)
        {
            textBox2.Text += Node.OuterHtml + "\r\n";
        }

    }

有关: HtmlWindow窗口= webBrowser.Document.Window;

我得到的错误:对象引用不设置到对象的实例

I get the error: Object reference not set to an instance of an object.

推荐答案

您可能没有,当你引用的浏览器窗口中完成了页面加载。您可以让浏览器控件触发navigationcomplete事件,当它完成。看到这个SO的例子回答:<一href="http://stackoverflow.com/questions/583897/c-sharp-how-to-wait-for-a-webpage-to-finish-loading-before-continuing">C#如何等待一个网页继续之前完成加载

You might not have the page load completed when you are referencing the browser window. You can have the browser control fire the navigationcomplete event when it is done. See this SO answer for an example: C# how to wait for a webpage to finish loading before continuing

这篇关于屏幕抓取,网页抓取,网站采集,Web数据抽取等使用C#和.NET Framework的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆