使用Python刮刮Yahoo Finance损益表 [英] Scrape Yahoo Finance Income Statement with Python

查看:118
本文介绍了使用Python刮刮Yahoo Finance损益表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python从 Yahoo Finance 上的损益表中抓取数据.具体来说,假设我要 净收入的最新数字 Apple .

数据由一堆嵌套的HTML表构成.我正在使用 requests 模块来访问它并检索HTML.

我正在使用 BeautifulSoup 4 来筛选HTML结构,但我不知道如何得到该图.

此处是Firefox分析的屏幕截图.

到目前为止,我的代码:

from bs4 import BeautifulSoup
import requests

myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual"
html = requests.get(myurl).content
soup = BeautifulSoup(html)

我尝试使用

all_strong = soup.find_all("strong")

然后得到第17个元素,它恰好是包含我想要的图形的元素,但这似乎还很优雅.像这样:

all_strong[16].parent.next_sibling
...

当然,目标是使用 BeautifulSoup 搜索我需要的图形名称(在本例中为净收入"),然后在HTML表格的同一行中抓取图形本身.

我非常感谢有关解决此问题的任何想法,请紧记我想应用该解决方案从Yahoo Finance其他页面检索大量其他数据.

解决方案/扩展:

下面@wilbur的解决方案有效,我对其进行了扩展,以便能够获取财务页面的 any any 数字的值(即收入声明 任何上市公司的现金流量表). 我的功能如下:

def periodic_figure_values(soup, yahoo_figure):

    values = []
    pattern = re.compile(yahoo_figure)

    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")

    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
            value = int(str_value) * 1000
            values.append(value)

    return values

yahoo_figure变量是一个字符串.显然,该名称必须与Yahoo Finance上使用的名称完全相同. 要传递soup变量,我首先使用以下函数:

def financials_soup(ticker_symbol, statement="is", quarterly=False):

    if statement == "is" or statement == "bs" or statement == "cf":
        url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol
        if not quarterly:
            url += "&annual"
        return BeautifulSoup(requests.get(url).text, "html.parser")

    return sys.exit("Invalid financial statement code '" + statement + "' passed.")

示例用法-我想从最近的可用损益表中获取Apple Inc.的所得税费用:

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))

输出:[19121000000, 13973000000, 13118000000]

您还可以从soup获取期末的日期,并创建一个字典,其中日期为键,数字为值,但这将使该帖子成为现实.太长. 到目前为止,这似乎对我有用,但我始终感谢建设性的批评.

解决方案

这变得有些困难,因为<strong>标记中包含了净收入",请耐心等待,但是我认为这可行:

import re, requests
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/q/is?s=AAPL&annual'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pattern = re.compile('Net Income')

title = soup.find('strong', text=pattern)
row = title.parent.parent # yes, yes, I know it's not the prettiest
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income'

values = [ c.text.strip() for c in cells ]

在这种情况下,

values将包含净收入"行中的三个表单元格(并且我可能会添加,可以很容易地将其转换为整数-我只是喜欢它们将','保留为字符串)

In [10]: values
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']

当我在Alphabet(GOOG)上进行测试时,它不起作用,因为它们没有显示我认为的损益表( https://finance.yahoo.com/q/is?s=FB&annual ).

如果您想创建一个更具动态性的脚本,则可以使用字符串格式来使用所需的任何股票代号来格式化url,如下所示:

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

I'm trying to scrape data from income statements on Yahoo Finance using Python. Specifically, let's say I want the most recent figure of Net Income of Apple.

The data is structured in a bunch of nested HTML-tables. I am using the requests module to access it and retrieve the HTML.

I am using BeautifulSoup 4 to sift through the HTML-structure, but I can't figure out how to get the figure.

Here is a screenshot of the analysis with Firefox.

My code so far:

from bs4 import BeautifulSoup
import requests

myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual"
html = requests.get(myurl).content
soup = BeautifulSoup(html)

I tried using

all_strong = soup.find_all("strong")

And then get the 17th element, which happens to be the one containing the figure I want, but this seems far from elegant. Something like this:

all_strong[16].parent.next_sibling
...

Of course, the goal is to use BeautifulSoup to search for the Name of the figure I need (in this case "Net Income") and then grab the figures themselves in the same row of the HTML-table.

I would really appreciate any ideas on how to solve this, keeping in mind that I would like to apply the solution to retrieve a bunch of other data from other Yahoo Finance pages.

SOLUTION / EXPANSION:

The solution by @wilbur below worked and I expanded upon it to be able to get the values for any figure available on any of the financials pages (i. e. Income Statement, Balance Sheet, and Cash Flow Statement) for any listed company. My function is as follows:

def periodic_figure_values(soup, yahoo_figure):

    values = []
    pattern = re.compile(yahoo_figure)

    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")

    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
            value = int(str_value) * 1000
            values.append(value)

    return values

The yahoo_figure variable is a string. Obviously this has to be the exact same figure name as is used on Yahoo Finance. To pass the soup variable, I use the following function first:

def financials_soup(ticker_symbol, statement="is", quarterly=False):

    if statement == "is" or statement == "bs" or statement == "cf":
        url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol
        if not quarterly:
            url += "&annual"
        return BeautifulSoup(requests.get(url).text, "html.parser")

    return sys.exit("Invalid financial statement code '" + statement + "' passed.")

Sample usage -- I want to get the income tax expenses of Apple Inc. from the last available income statements:

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))

Output: [19121000000, 13973000000, 13118000000]

You could also get the date of the end of the period from the soup and create a dictionary where the dates are the keys and the figures are the values, but this would make this post too long. So far this seems to work for me, but I am always thankful for constructive criticism.

解决方案

This is made a little more difficult because the "Net Income" in enclosed in a <strong> tag, so bear with me, but I think this works:

import re, requests
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/q/is?s=AAPL&annual'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pattern = re.compile('Net Income')

title = soup.find('strong', text=pattern)
row = title.parent.parent # yes, yes, I know it's not the prettiest
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income'

values = [ c.text.strip() for c in cells ]

values, in this case, will contain the three table cells in that "Net Income" row (and, I might add, can easily be converted to ints - I just liked that they kept the ',' as strings)

In [10]: values
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']

When I tested it on Alphabet (GOOG) - it doesn't work because they don't display an Income Statement I believe (https://finance.yahoo.com/q/is?s=GOOG&annual) but when I checked Facebook (FB), the values were returned correctly (https://finance.yahoo.com/q/is?s=FB&annual).

If you wanted to create a more dynamic script, you could use string formatting to format the url with whatever stock symbol you want, like this:

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

这篇关于使用Python刮刮Yahoo Finance损益表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆