无法获取表格数据 - HTML [英] Cannot get table data - HTML

查看:161
本文介绍了无法获取表格数据 - HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从以下网址获取收入公告表: https: //www.zacks.com/stock/research/amzn/earnings-announcements



我正在使用不同的美丽选项,但没有人获得表格。

p>

  table = soup.find('table',attrs = {'class':'earnings_announcements_earnings_table'})

table = soup.find_all('table')

当我检查表格时,这个表在那里。



我粘贴了一部分代码(js,json?)。

  document.obj_data = {
earnings_announcements_earnings_table:
[[2017年10月26日,9/2017,0.06美元, - , - , - , - ],[2017/6/27,2017/2017,$ 1.40,$ 0.40,< div class = \right neg negative neg_icon showinline down \> -1.00< / div>, < div class = \right neg negative neg_icon showinline down \> -71.43%< / div>,关闭后],[2017/4/27, ,$ 1.03,$ 1.48,< div class = \right pos positive pos_icon showinline up\> +0.45< / div>,< div class = \right pos positive pos_icon showinline up\> + 43.69%< / div>,关闭后],[2/2/2017,12/2016,$ 1.40,$ 1.54 div class = \right pos positive pos_icon showinline up\> + 0.14< / div>,< div class = \right pos positive pos_icon showinline up \> + 10.00%< ; / div>,关闭后],[10/27/2016,9/2016,$ 0.85,$ 0.52,< div class = \right neg negative neg_icon showinline down \\ - > -0.33< / div>,< div class = \\right neg negative neg_icon showinline down \> -38.82%< / div> [7/28/2016,6/2016,$ 1.14,$ 1.78,< div class = \right pos positive pos_icon showinline up\> +0.64 < / div>,< div class = \right pos positive pos_icon showinline up\> + 56.14%< / div>,关闭后],[2016/4/28 ,3/2016,$ 0.61,$ 1.07,< div class = \right pos positive pos_icon showinline up\> +0.46< / div>,< div class = \right pos positive pos_icon showinline up\> + 75.41%< / div>,关闭后],[1/28/2016,12/2015,$ 1.61, $ 1.00,< div class = \right neg negative neg_icon showinline down\\> -0.61< / div>,< div class = \right neg negative neg_icon showinline down \ > -37.89%< / div>,关闭后,2015/10/22,9/2015, - $ 0.1,$ 0.17,< div class = \\right pos positive pos_icon showinline up \> + 0.27< / div>,< div class = \right pos positive pos_icon showinline up \> + 270.00%

,关闭后],[7/23/2015,6/2015, - $ 0.15,$ 0.19,< div class = \right pos positive pos_icon showinline up \\>> + 0.34< / div>,< div class = \right pos positive pos_icon showinline up \> + 226.67%

,关闭后 [4/23/2015,3/2015, - $ 0.13, - $ 0.12,< div class = \right pos positive pos_icon showinline up \> + 0.01< div>,< div class = \right pos positive pos_icon showinline up\> + 7.69%

,关闭后],[1/29/2015, 12/2014,$ 0.24,$ 0.45,< div class = \right pos positive pos_icon showinline up\> + 0.21

,div class = \ right pos positive pos_icon showinline up \> + 87.50%< / div>,关闭后],[2014/10/23,9/2014, - $ 0.73, $ 0.95,< div class = \right neg negative neg_icon showinline down \> -0.22< / div>,< div class = \right neg negative neg_icon showinline down \ > -30.14%< / div>,关闭后,[7/24/2014,6/2014, - $ 0.13, - $ 0.27,< div class = \\正确的负面新g_icon showinline down \> -0.14< / div>,< div class = \right neg negative neg_icon showinline down \> -107.69%< / div> ],[4/24/2014,3/2014,$ 0.22,$ 0.23,< div class = \right pos positive pos_icon showinline up \\> + 0.01< / div>,< div class = \right pos positive pos_icon showinline up\> + 4.55%

,关闭后],[1/30/2014, 12/2013​​,$ 0.68,$ 0.51,< div class = \right neg negative neg_icon showinline down\\> -0.17

,< div class = \\right neg negative neg_icon showinline down \> -25.00%< / div>,关闭后],[10/24/2013,9/2013, - $ 0.09 - $ 0.09,< div class = \right pos_na showinline\> 0.00< / div>,< div class = \right pos_na showinline \> 0.00%< / div>,关闭后],[7/25/2013,6/2013,$ 0.04, - $ 0.02,< div class = \right neg negative neg_icon showinline < / div>,< div class = \right neg negative neg_icon showinline down \> -150.00%< / div>,关闭后] ,[4/25/2013​​,3/2013​​,$ 0.10,$ 0.18,< div class = \right pos positive pos_icon showinline up\> + 0.08< / div> ;,< div class = \right pos positive pos_icon showinline up\> + 80.00%

,关闭后],[1/29/2013,12 / 2012,$ 0.28,$ 0.21,< div class = \right neg negative neg_icon showinline down\\> -0.07

我怎么能得到这张桌子?
Thanks!

解决方案

所以解决方案是使用Python的字符串和RegExp函数解析整个HTML文档,而不是使用BeautifulSoup因为我们不是试图从HTML标签获取数据,而是想让它们进入JS代码。



所以这段代码基本上是将JS数组放在 earnings_announcements_earnings_table,并且由于JS Array与Python的列表结构相同,所以我只是使用ast来解析它。结果是可以循环访问的列表,并显示表中所有页面的所有数据。

  import urllib2 
import re
import ast
$ b user_agent = {'User-Agent':'Mozilla / 5.0(Windows NT 6.1; WOW64; rv:12.0)Gecko / 20100101 Firefox / 12.0' }
req = urllib2.Request('https://www.zacks.com/stock/research/amzn/earnings-announcements',None,user_agent)
source = urllib2.urlopen(req)。 read()

compiled = re.compile('earnings_announcements_earnings_table\s + \:',flags = re.IGNORECASE | re.DOTALL)
match = re.search(编译,源)
如果匹配:
source = source [match.end():len(source)]

compiled = re.compile('earnings_announcements_webcasts_table',flags = re.IGNORECASE | re.DOTALL)
match = re.search(编译,源代码)
如果匹配:
source = source [0:match.start()]

result = ast.literal_eval(str(source).strip('\r\\\
\t'))
print r esult

让我知道您是否需要澄清。


I am trying to get the 'Earnings Announcements table' from: https://www.zacks.com/stock/research/amzn/earnings-announcements

I am using different beautifulsoup options but none get the table.

table = soup.find('table', attrs={'class': 'earnings_announcements_earnings_table'})

table = soup.find_all('table')

When I inspect the table, the elements of the table are there.

I am pasting a portion of the code I am getting for the table (js, json?).

document.obj_data = {
"earnings_announcements_earnings_table"   : 
         [  [ "10/26/2017", "9/2017", "$0.06", "--", "--", "--", "--" ] ,  [ "7/27/2017", "6/2017", "$1.40", "$0.40", "<div class=\"right neg negative neg_icon showinline down\">-1.00</div>", "<div class=\"right neg negative neg_icon showinline down\">-71.43%</div>", "After Close" ] ,  [ "4/27/2017", "3/2017", "$1.03", "$1.48", "<div class=\"right pos positive pos_icon showinline up\">+0.45</div>", "<div class=\"right pos positive pos_icon showinline up\">+43.69%</div>", "After Close" ] ,  [ "2/2/2017", "12/2016", "$1.40", "$1.54", "<div class=\"right pos positive pos_icon showinline up\">+0.14</div>", "<div class=\"right pos positive pos_icon showinline up\">+10.00%</div>", "After Close" ] ,  [ "10/27/2016", "9/2016", "$0.85", "$0.52", "<div class=\"right neg negative neg_icon showinline down\">-0.33</div>", "<div class=\"right neg negative neg_icon showinline down\">-38.82%</div>", "After Close" ] ,  [ "7/28/2016", "6/2016", "$1.14", "$1.78", "<div class=\"right pos positive pos_icon showinline up\">+0.64</div>", "<div class=\"right pos positive pos_icon showinline up\">+56.14%</div>", "After Close" ] ,  [ "4/28/2016", "3/2016", "$0.61", "$1.07", "<div class=\"right pos positive pos_icon showinline up\">+0.46</div>", "<div class=\"right pos positive pos_icon showinline up\">+75.41%</div>", "After Close" ] ,  [ "1/28/2016", "12/2015", "$1.61", "$1.00", "<div class=\"right neg negative neg_icon showinline down\">-0.61</div>", "<div class=\"right neg negative neg_icon showinline down\">-37.89%</div>", "After Close" ] ,  [ "10/22/2015", "9/2015", "-$0.1", "$0.17", "<div class=\"right pos positive pos_icon showinline up\">+0.27</div>", "<div class=\"right pos positive pos_icon showinline up\">+270.00%</div>", "After Close" ] ,  [ "7/23/2015", "6/2015", "-$0.15", "$0.19", "<div class=\"right pos positive pos_icon showinline up\">+0.34</div>", "<div class=\"right pos positive pos_icon showinline up\">+226.67%</div>", "After Close" ] ,  [ "4/23/2015", "3/2015", "-$0.13", "-$0.12", "<div class=\"right pos positive pos_icon showinline up\">+0.01</div>", "<div class=\"right pos positive pos_icon showinline up\">+7.69%</div>", "After Close" ] ,  [ "1/29/2015", "12/2014", "$0.24", "$0.45", "<div class=\"right pos positive pos_icon showinline up\">+0.21</div>", "<div class=\"right pos positive pos_icon showinline up\">+87.50%</div>", "After Close" ] ,  [ "10/23/2014", "9/2014", "-$0.73", "-$0.95", "<div class=\"right neg negative neg_icon showinline down\">-0.22</div>", "<div class=\"right neg negative neg_icon showinline down\">-30.14%</div>", "After Close" ] ,  [ "7/24/2014", "6/2014", "-$0.13", "-$0.27", "<div class=\"right neg negative neg_icon showinline down\">-0.14</div>", "<div class=\"right neg negative neg_icon showinline down\">-107.69%</div>", "After Close" ] ,  [ "4/24/2014", "3/2014", "$0.22", "$0.23", "<div class=\"right pos positive pos_icon showinline up\">+0.01</div>", "<div class=\"right pos positive pos_icon showinline up\">+4.55%</div>", "After Close" ] ,  [ "1/30/2014", "12/2013", "$0.68", "$0.51", "<div class=\"right neg negative neg_icon showinline down\">-0.17</div>", "<div class=\"right neg negative neg_icon showinline down\">-25.00%</div>", "After Close" ] ,  [ "10/24/2013", "9/2013", "-$0.09", "-$0.09", "<div class=\"right pos_na showinline\">0.00</div>", "<div class=\"right pos_na showinline\">0.00%</div>", "After Close" ] ,  [ "7/25/2013", "6/2013", "$0.04", "-$0.02", "<div class=\"right neg negative neg_icon showinline down\">-0.06</div>", "<div class=\"right neg negative neg_icon showinline down\">-150.00%</div>", "After Close" ] ,  [ "4/25/2013", "3/2013", "$0.10", "$0.18", "<div class=\"right pos positive pos_icon showinline up\">+0.08</div>", "<div class=\"right pos positive pos_icon showinline up\">+80.00%</div>", "After Close" ] ,  [ "1/29/2013", "12/2012", "$0.28", "$0.21", "<div class=\"right neg negative neg_icon showinline down\">-0.07</div>", "<div class=\"right neg negative neg_icon showinline down\">-25.00%</div>", "After Close" ] ,  [ "10/25/2012", "9/2012", "-$0.08", "-$0.23", "<div class=\"right neg negative neg_icon showinline down\">-0.15</div>", "<div class=\"right neg negative neg_icon showinline down\">-187.50%</div>", "After Close" ] ,  [ "7/26/2012", "6/2012", "--", "--", "--", "--", "After Close" ] ,  [ "4/26/2012", "3/2012", "--", "--", "--", "--", "After Close" ] ,  [ "1/31/2012", "12/2011", "--", "--", "--", "--", "After Close" ] ,  [ "10/25/2011", "9/2011", "--", "--", "--", "--", "After Close" ] ,  [ "7/26/2011", "6/2011", "--", "--", "--", "--", "After Close" ] ,  [ "4/26/2011", "3/2011", "--", "--", "--", "--", "--" ] ,  [ "1/27/2011", "12/2010", "--", "--", "--", "--", "After Close" ] ,  [ "10/21/2010", "9/2010", "--", "--", "--", "--", "After Close" ] ,  [ "7/22/2010", "6/2010", "--", "--", "--", "--", "After Close" ] ,  [ "4/22/2010", "3/2010", "--", "--", "--", "--", "After Close" ] ,  [ "1/28/2010", "12/2009", "--", "--", "--", "--", "After Close" ] ,  [ "10/22/2009", "9/2009", "--", "--", "--", "--", "After Close" ] ,  [ "7/23/2009", "6/2009", "--", "--", "--", "--", "After Close" ]  ]

How could I get this table? Thanks!

解决方案

So the solution is to parse the whole HTML document using Python's string and RegExp functions instead of BeautifulSoup because we are not trying to get the data from HTML tags but instead we want to get them inside a JS code.

So this code basically, get the JS array inside "earnings_announcements_earnings_table" and since the JS Array is the same as Python's list structure, I just parse it using ast. The result is a list were you can loop into and it shows all data from all the pages of the table.

import urllib2
import re
import ast

user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
req = urllib2.Request('https://www.zacks.com/stock/research/amzn/earnings-announcements', None, user_agent)
source = urllib2.urlopen(req).read()

compiled = re.compile('"earnings_announcements_earnings_table"\s+\:', flags=re.IGNORECASE | re.DOTALL)
match = re.search(compiled, source)
if match:
    source = source[match.end(): len(source)]

compiled = re.compile('"earnings_announcements_webcasts_table"', flags=re.IGNORECASE | re.DOTALL)
match = re.search(compiled, source)
if match:
    source = source[0: match.start()]

result = ast.literal_eval(str(source).strip('\r\n\t, '))
print result

Let me know if you need clarifications.

这篇关于无法获取表格数据 - HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
前端开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆