刮动态内容 [英] Scraping Dynamic content

查看:156
本文介绍了刮动态内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

,才有可能勉强度日动态网页。我所产生的数据意味着,例如本网站生成标记< FONT>一些Java脚本这是

is it possible to scrape data generated by dynamic web page .I means for example This website generates the tag <font> by some java script which is

document.write("<font class=spy2>:<\/font>"+(v2j0j0^o5r8)+(r8d4x4^y5i9)+(b2r8e5^u1p6)+(r8d4x4^y5i9))

的值更改每个页面refresh.each生成的代码表示数字0 - 9例如(代码1)+(代码2)+(CODE3)+(码4)并在后端某种类型的语法分析的被写入其中的理解,并相应地生成的数字。

the values change on each page refresh.each generated code represents a number 0 - 9 for example (code1)+(code2)+(code3)+(code4) and at the back end some type of parse is written which understands it and generates the numbers accordingly.

在渲染页面,例如代码1 设置一些地方为数字4有史以来位4生成它来自这个代码在哪里后得到解析

once page rendered and for example code1 was set some where for digit 4 the where ever the digit 4 is generated it comes from this code after getting parsed

\If我们使用 HtmlAgilityPack 我们看到,Java脚本代码,但不能将其产生的output.so有没有什么办法,我们可以读取标签它创建时的页面呈现?

\If we use HtmlAgilityPack we see that java script code but not its generated output.so is there any way we can read the tag it creates when the page is rendered?

推荐答案

感谢您指出out.I只见上面通过实施.same的结果,但随后在看多了一个评论说,谁使用IE引擎我转身做了一个小的应用程序,做的工作。我加入IE并导航到该网站并阅读content.Here是代码

Thanks for pointing out.I saw that by implementing .same results but then looking at one more comment who says use IE engine i turned and made a small application that does the job.I added IE and navigated it to the website and read the content.Here is the code

 private void webBrowser1_DocumentCompleted(object sender, System.Windows.Forms.WebBrowserDocumentCompletedEventArgs e)
        {
  System.Windows.Forms.HtmlElementCollection elementsforViewPost =
                                webBrowser1.Document.GetElementsByTagName("font");
  foreach (System.Windows.Forms.HtmlElement current2 in elementsforViewPost)
  {
  if (current2.InnerText != null && CheckForValidProxyAddress(current2.InnerText) &&
                    ObtainedProxies.Where(index=>index.ProxyAddress == current2.InnerText.Trim()).ToList().Count == 0)
 {
   Proxy data = new Proxy();
   data.IsRetired = false;
   data.IsActive = true;
   int result = 1;                   

   data.DomainsVisited = 0;
   data.ProxyAddress = current2.InnerText.Trim();

   ObtainedProxies.Add(data);
}

和为接收的文本是有效的代理这里检查是什么,我也知道了一些页面不久前谷歌搜索

and for checking that received text is valid proxy here is what i did got it from some page long ago by googling

  private bool CheckForValidProxyAddress(string address)
        {

        //create our match pattern
        //string pattern = @"^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}$:([0-9][0-9][0-9][0-9])";
        string pattern = @"\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b\:[0-9]{0,4}";
        //create our Regular Expression object
        Regex check = new Regex(pattern);
        //boolean variable to hold the status
        bool valid = false;
        //check to make sure an ip address was provided
        if (address == "")
        {
            //no address provided so return false
            valid = false;
        }
        else
        {
            //address provided so use the IsMatch Method
            //of the Regular Expression object
            valid = check.IsMatch(address, 0);
        }
        //return the results
        return valid;
    }

这篇关于刮动态内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆