在C#中抓取动态Web内容 [英] Scraping dynamic web content in C#

查看:121
本文介绍了在C#中抓取动态Web内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以抓取动态网页生成的数据?我的意思是例如此网站生成标签< font> 和一些Java脚本

Is it possible to scrape data generated by dynamic web page? I mean for example This website generates the tag <font> with some java script which is

document.write("<font class=spy2>:<\/font>"+(v2j0j0^o5r8)+(r8d4x4^y5i9)+(b2r8e5^u1p6)+(r8d4x4^y5i9))

每次刷新页面时,值都会更改.每个生成的代码代表0到9之间的数字,例如(code1)+(code2)+(code3)+(code4),并且在后端编写了某种类型的解析器,可以理解并解析相应地生成数字.

The values change on each page refresh. Each generated code represents a number from 0 to 9, for example (code1)+(code2)+(code3)+(code4) and at the back end some type of parser is written which understands it and generates the numbers accordingly.

一旦呈现了页面,例如将 code1 设置为数字4的某个位置,则生成该数字4的位置将在解析后从此代码产生.

Once the page is rendered and for example code1 was set some where for digit 4 the where ever the digit 4 is generated it comes from this code after getting parsed.

如果我们使用 HtmlAgilityPack ,我们会看到该Java脚本代码,但不会看到其生成的输出.呈现页面时,我们有什么方法可以读取它创建的标签?

If we use HtmlAgilityPack we see that java script code but not its generated output. Is there any way we can read the tag it creates when the page is rendered?

推荐答案

感谢指出.我通过实现.same结果看到了这一点,但随后又看了另一条评论,说我使用IE引擎,我转向并制作了一个小型应用程序,完成工作.我添加了IE,并将其导航到网站并阅读了内容.这是代码

Thanks for pointing out.I saw that by implementing .same results but then looking at one more comment who says use IE engine i turned and made a small application that does the job.I added IE and navigated it to the website and read the content.Here is the code

 private void webBrowser1_DocumentCompleted(object sender, System.Windows.Forms.WebBrowserDocumentCompletedEventArgs e)
        {
  System.Windows.Forms.HtmlElementCollection elementsforViewPost =
                                webBrowser1.Document.GetElementsByTagName("font");
  foreach (System.Windows.Forms.HtmlElement current2 in elementsforViewPost)
  {
  if (current2.InnerText != null && CheckForValidProxyAddress(current2.InnerText) &&
                    ObtainedProxies.Where(index=>index.ProxyAddress == current2.InnerText.Trim()).ToList().Count == 0)
 {
   Proxy data = new Proxy();
   data.IsRetired = false;
   data.IsActive = true;
   int result = 1;                   

   data.DomainsVisited = 0;
   data.ProxyAddress = current2.InnerText.Trim();

   ObtainedProxies.Add(data);
}

为了检查接收到的文本是否有效,这是我很久以前通过谷歌搜索从某些页面上获取的信息

and for checking that received text is valid proxy here is what i did got it from some page long ago by googling

  private bool CheckForValidProxyAddress(string address)
        {

        //create our match pattern
        //string pattern = @"^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}$:([0-9][0-9][0-9][0-9])";
        string pattern = @"\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b\:[0-9]{0,4}";
        //create our Regular Expression object
        Regex check = new Regex(pattern);
        //boolean variable to hold the status
        bool valid = false;
        //check to make sure an ip address was provided
        if (address == "")
        {
            //no address provided so return false
            valid = false;
        }
        else
        {
            //address provided so use the IsMatch Method
            //of the Regular Expression object
            valid = check.IsMatch(address, 0);
        }
        //return the results
        return valid;
    }

这篇关于在C#中抓取动态Web内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆