HTMLAgilityPack使用C#解析HTML的问题 [英] Issue with HTMLAgilityPack parsing HTML using C#

查看:65
本文介绍了HTMLAgilityPack使用C#解析HTML的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只是想了解HTMLAgilityPack和XPath,我想从纳斯达克网站上获得(HTML链接)公司的列表;

I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;

http://www.nasdaq.com/quotes/nasdaq-100- stocks.aspx

我目前有以下代码;

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

        // Create a request for the URL.        
        WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
        // Get the response.
        HttpWebResponse response = (HttpWebResponse)request.GetResponse();
        // Get the stream containing content returned by the server.
        Stream dataStream = response.GetResponseStream();
        // Open the stream using a StreamReader for easy access.
        StreamReader reader = new StreamReader(dataStream);
        // Read the content.
        string responseFromServer = reader.ReadToEnd();
        // Read into a HTML store read for HAP
        htmlDoc.LoadHtml(responseFromServer);

        HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[@id='indu_table']/tbody/tr[*]/td/b/a");
        foreach (HtmlAgilityPack.HtmlNode node in tl)
        {
            Debug.Write(node.InnerText);
        }            

        // Cleanup the streams and the response.
        reader.Close();
        dataStream.Close();
        response.Close();

我已经使用了Chrome的XPath插件来获取XPath;

I've used an XPath addon for Chrome to get the XPath of;

//*table[@id='indu_table']/tbody/tr[*]/td/b/a

在运行我的项目时,我得到了一个xpath未处理的异常,有关该异常是无效令牌.

When running my project, I get an xpath unhandled exception about it being an invalid token.

我有点不确定它是怎么了,我试图在上面的tr [*]部分中输入一个数字,但是我仍然遇到相同的错误.

I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.

我已经看了最后一个小时了,这简单吗?

I've been looking at this for the last hour, is it anything simple?

谢谢

推荐答案

由于数据来自javascript,因此您必须解析javascript而不是html,因此Agility Pack并不能提供太多帮助,但是它使事情变得很简单.容易一点.以下是使用Agility Pack和 Newtonsoft JSON.Net 解析Java语言的方法.

Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
  //Using Regex here to get just the array we're interested in...
  string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
  JArray jArray = JArray.Parse(stockArray);
  foreach (JToken token in jArray.Children())
  {
    listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
  }
}

为了更详细地说明,数据来自页面var table_body = [...上的一个大javascript数组. 每只股票都是数组中的一个元素,并且本身就是数组.

To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [.... Each stock is one element in the array and is an array itself.

["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]

因此,通过解析数组并获取第一个元素并附加修订网址,我们得到与javascript相同的结果.

So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.

这篇关于HTMLAgilityPack使用C#解析HTML的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆