用 Python 抓取雅虎财务损益表 [英] Scrape Yahoo Finance Income Statement with Python

查看:29
本文介绍了用 Python 抓取雅虎财务损益表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 从 雅虎财经 的损益表中抓取数据.具体来说,假设我想要最新的净收入数据 Apple.

数据由一堆嵌套的 HTML 表格构成.我正在使用 requests 模块来访问和检索HTML.

我正在使用 BeautifulSoup 4 来筛选 HTML-结构,但我不知道如何得到这个数字.

这里是 Firefox 分析的截图.

到目前为止我的代码:

from bs4 import BeautifulSoup进口请求myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual"html = requests.get(myurl).content汤 = BeautifulSoup(html)

我尝试使用

all_strong = soup.find_all("strong")

然后得到第17个元素,恰好是包含我想要的图形的元素,但这似乎很不优雅.像这样:

all_strong[16].parent.next_sibling...

当然,目标是使用BeautifulSoup搜索我需要的数字的名称(在本例中为净收入"),然后在 HTML 表格的同一行中获取数字本身.>

我非常感谢有关如何解决此问题的任何想法,请记住,我想应用该解决方案从其他雅虎财经页面检索大量其他数据.

解决方案/扩展:

@wilbur 下面的解决方案奏效了,我对其进行了扩展,以便能够获得 任何 财务页面(即 损益表、资产负债表任何上市公司的现金流量表).我的功能如下:

defperiodic_figure_values(soup, yahoo_figure):值 = []模式=重新编译(yahoo_figure)title = soup.find("strong", text=pattern) # 适用于以粗体打印的数字如果标题:行 = title.parent.parent别的:title = soup.find("td", text=pattern) # 适用于任何其他可用图形如果标题:行 = title.parent别的:sys.exit("无效图形'" + yahoo_figure + "'通过.")cells = row.find_all("td")[1:] # 排除 <td>有图名对于单元格中的单元格:if cell.text.strip() != yahoo_figure: # 需要,因为有些数字是缩进的str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")如果 str_value == "-":str_value = 0值 = int(str_value) * 1000值.附加(值)返回值

yahoo_figure 变量是一个字符串.显然,这必须与雅虎财经上使用的数字名称完全相同.要传递 soup 变量,我首先使用以下函数:

deffinances_soup(ticker_symbol, statement="is", seasonly=False):if statement == "is" or statement == "bs" or statement == "cf":url = "https://finance.yahoo.com/q/" + 声明 + "?s=" + ticker_symbol如果不是季度:url += "&年"返回 BeautifulSoup(requests.get(url).text, "html.parser")return sys.exit("无效的财务报表代码'" + statement + "'通过.")

示例用法--我想从最后可用的损益表中获取 Apple Inc. 的所得税费用:

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))

输出:[19121000000, 13973000000, 13118000000]

您还可以从 soup 中获取期末的日期,并创建一个字典,其中日期是键,数字是值,但是这会使这篇文章太长.到目前为止,这似乎对我有用,但我总是感谢建设性的批评.

解决方案

这有点困难,因为净收入"包含在 标签中,所以请耐心等待我,但我认为这有效:

import re, requests从 bs4 导入 BeautifulSoupurl = 'https://finance.yahoo.com/q/is?s=AAPL&annual'r = requests.get(url)汤 = BeautifulSoup(r.text, 'html.parser')pattern = re.compile('净收入')标题 = 汤.find('strong', text=pattern)row = title.parent.parent # 是的,是的,我知道这不是最漂亮的cells = row.find_all('td')[1:] #排除<td>与净收入"values = [ c.text.strip() for c in cell ]

在这种情况下,

values 将包含净收入"行中的三个表格单元格(而且,我可能会补充说,可以轻松转换为整数 - 我只是喜欢他们保留了',' 作为字符串)

在 [10] 中:值出 [10]: [u'53,394,000', u'39,510,000', u'37,037,000']

当我在 Alphabet (GOOG) 上测试它时 - 它不起作用,因为它们没有显示我认为的损益表 (https://finance.yahoo.com/q/is?s=GOOG&annual)但是当我查看 Facebook (FB) 时,这些值被正确返回(https://finance.yahoo.com/q/is?s=FB&annual).

如果您想创建一个更动态的脚本,您可以使用字符串格式将 url 格式化为您想要的任何股票代码,如下所示:

ticker_symbol = 'AAPL' # 或 'FB' 或任何其他股票代码url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

I'm trying to scrape data from income statements on Yahoo Finance using Python. Specifically, let's say I want the most recent figure of Net Income of Apple.

The data is structured in a bunch of nested HTML-tables. I am using the requests module to access it and retrieve the HTML.

I am using BeautifulSoup 4 to sift through the HTML-structure, but I can't figure out how to get the figure.

Here is a screenshot of the analysis with Firefox.

My code so far:

from bs4 import BeautifulSoup
import requests

myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual"
html = requests.get(myurl).content
soup = BeautifulSoup(html)

I tried using

all_strong = soup.find_all("strong")

And then get the 17th element, which happens to be the one containing the figure I want, but this seems far from elegant. Something like this:

all_strong[16].parent.next_sibling
...

Of course, the goal is to use BeautifulSoup to search for the Name of the figure I need (in this case "Net Income") and then grab the figures themselves in the same row of the HTML-table.

I would really appreciate any ideas on how to solve this, keeping in mind that I would like to apply the solution to retrieve a bunch of other data from other Yahoo Finance pages.

SOLUTION / EXPANSION:

The solution by @wilbur below worked and I expanded upon it to be able to get the values for any figure available on any of the financials pages (i. e. Income Statement, Balance Sheet, and Cash Flow Statement) for any listed company. My function is as follows:

def periodic_figure_values(soup, yahoo_figure):

    values = []
    pattern = re.compile(yahoo_figure)

    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")

    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
            value = int(str_value) * 1000
            values.append(value)

    return values

The yahoo_figure variable is a string. Obviously this has to be the exact same figure name as is used on Yahoo Finance. To pass the soup variable, I use the following function first:

def financials_soup(ticker_symbol, statement="is", quarterly=False):

    if statement == "is" or statement == "bs" or statement == "cf":
        url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol
        if not quarterly:
            url += "&annual"
        return BeautifulSoup(requests.get(url).text, "html.parser")

    return sys.exit("Invalid financial statement code '" + statement + "' passed.")

Sample usage -- I want to get the income tax expenses of Apple Inc. from the last available income statements:

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))

Output: [19121000000, 13973000000, 13118000000]

You could also get the date of the end of the period from the soup and create a dictionary where the dates are the keys and the figures are the values, but this would make this post too long. So far this seems to work for me, but I am always thankful for constructive criticism.

解决方案

This is made a little more difficult because the "Net Income" in enclosed in a <strong> tag, so bear with me, but I think this works:

import re, requests
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/q/is?s=AAPL&annual'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pattern = re.compile('Net Income')

title = soup.find('strong', text=pattern)
row = title.parent.parent # yes, yes, I know it's not the prettiest
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income'

values = [ c.text.strip() for c in cells ]

values, in this case, will contain the three table cells in that "Net Income" row (and, I might add, can easily be converted to ints - I just liked that they kept the ',' as strings)

In [10]: values
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']

When I tested it on Alphabet (GOOG) - it doesn't work because they don't display an Income Statement I believe (https://finance.yahoo.com/q/is?s=GOOG&annual) but when I checked Facebook (FB), the values were returned correctly (https://finance.yahoo.com/q/is?s=FB&annual).

If you wanted to create a more dynamic script, you could use string formatting to format the url with whatever stock symbol you want, like this:

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

这篇关于用 Python 抓取雅虎财务损益表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆