从财务报表中抓取HTML [英] Scraping HTML from Financial Statements

查看:153
本文介绍了从财务报表中抓取HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

第一次尝试学习如何在Visual Studio和C#中使用HTML.我正在使用 html敏捷包库.进行解析.

First attempt at learning to work with HTML in Visual Studio and C#. I am using html agility pack library. to do the parsing.

从此页面 I正在尝试从此页面的各个位置提取信息,并将其保存为格式正确的字符串

From this page I am attempting to pull out information from various places within this page and save them as correctly formatted strings

这是我当前的代码(摘自:尖叫 )

here is my current code (taken from: shriek )

HtmlNode tdNode = document.DocumentNode.DescendantNodes().FirstOrDefault(n => n.Name == "td"
&& n.InnerText.Trim() == "Net Income");
if (tdNode != null)
{
  HtmlNode trNode = tdNode.ParentNode;
  foreach (HtmlNode node in trNode.DescendantNodes().Where(n => n.NodeType ==     HtmlNodeType.Element))
  {
    Console.WriteLine(node.InnerText.Trim());
    //Output:
    //Net Income
    //265.00
    //298.00
    //601.00
    //672.00
    //666.00
  }
 }

它可以正常工作,但是我想获取更多信息,并且不确定如何正确搜索html.首先,我还希望能够从年度数据中而不是从季度数据中选择这些数字(页面顶部的查看"选项).

It works correctly however I want to get more information and I am unsure of how to search through the html correctly. First I would like to also be able to select these numbers from the annual data, not only from the quarterly, (View option at the top of the page).

我还希望获取每列数字的日期,包括季度和年度日期(每列顶部的截止日期")

I would also like to get the dates for each column of numbers, both quarterly and annual (the "As of ..." at the top of each column)

对于将来的项目,谷歌是否为此提供API?

also for future projects, does google provide an API for this?

推荐答案

如果仔细查看原始输入html源,您将看到其数据围绕着DIV html元素的6个主要部分进行了组织,其中一个是以下'id'属性:"incinterimdiv""incannualdiv""balinterimdiv""balannualdiv""casinterimdiv""casannualdiv".显然,这些匹配季度年度数据.

If you take a close look at the original input html source, you will see its data is organized around 6 main sections that are DIV html elements with one of the following 'id' attributes: "incinterimdiv" "incannualdiv" "balinterimdiv" "balannualdiv" "casinterimdiv" "casannualdiv". Obviously, these matches Income Statement, Balance Sheet, and Cash Flow for Quaterly or Annual Data.

现在,当您使用HTML Agility Pack抓取网站时,建议您使用XPATH,这是到达HTML代码内任何节点的最简单方法,不依赖XML ,因为HTML Agility Pack支持HTML上的纯 XPATH .

Now, when you're scraping a site with Html Agility Pack, I suggest you use XPATH wich is the easiest way to get to any node inside the HTML code, without any dependency on XML, as Html Agility Pack supports plain XPATH over HTML.

当然,必须学习XPATH,但它非常优雅,因为它仅用一行就可以完成很多事情.我知道,使用新的面向C#的超酷XLinq语法:)看起来可能是过时的,但是XPATH更为简洁.它还使您可以将代码和输入HTML之间的绑定集中在普通的旧字符串中,并避免在输入源发生变化时(例如,在ID更改时)重新编译代码.这使您的抓取代码更加健壮,并且面向未来.您还可以将XPATH绑定放入XSL(T)文件中,以便能够将HTML 转换到以XML表示的数据中.

XPATH has to be learned, for sure, but is very elegant because it does so many things in just one line. I know this may look old-fashioned with the new cool C#-oriented XLinq syntax :), but XPATH is much more concise. It also enables you to concentrate the bindings between your code and the input HTML in plain old strings, and avoid recompilation of the code when the input source evolves (for example, when the ID change). This make your scraping code more robust, and future-proof. You could also put the XPATH bindings in an XSL(T) file, to be able to transform the HTML into the data presented as XML.

无论如何,足够的题外:)这是一个示例代码,它使您可以从特定行标题中获取财务数据,而另一个代码可以从所有行中(从6个主要部分之一中)获取所有数据:

Anyway, enough digression :) Here is a sample code that allows you to get the financial data from a specific line title, and another that gets all data from all lines (from one of the 6 main sections):

        HtmlWeb web = new HtmlWeb();
        HtmlDocument doc = web.Load("http://www.google.com/finance?q=NASDAQ:TXN&fstype=ii");

        // How get a specific line:
        // 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
        // 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
        // 3) recursively get all TD elements containing the given text (trimmed)
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@id='casannualdiv']/table[@id='fs-table']//td[normalize-space(text()) = 'Deferred Taxes']"))
        {
            Console.WriteLine("Title:" + node.InnerHtml.Trim());

            // get all following sibling TD elements
            foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
            {
                Console.WriteLine(" data:" + sibling.InnerText.Trim()); // InnerText works also for negative values
            }
        }

        // How to get all lines:
        // 1) recursively get all DIV elements with the 'id' attribute set to 'casannualdiv'
        // 2) get all TABLE elements under, with the 'id' attribute set to 'fs-table'
        // 3) recursively get all TD elements containing the class 'lft lm'
        foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@id='casannualdiv']/table[@id='fs-table']//td[@class='lft lm']"))
        {
            Console.WriteLine("Title:" + node.InnerHtml.Trim());
            foreach (HtmlNode sibling in node.SelectNodes("following-sibling::td"))
            {
                Console.WriteLine(" data:" + sibling.InnerText.Trim());
            }
        }

这篇关于从财务报表中抓取HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆