网页抓取HTML表使用Python [英] Web Scraping HTML Table Using Python

查看:178
本文介绍了网页抓取HTML表使用Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我觉得我真的很接近,所以任何帮助,将AP preciated。试图从标题为股市活动在纳斯达克的网页表格刮索引和值数据:

 高清get_index_prices(NASDAQ_URL):
    HTML =的urlopen(NASDAQ_URL).read()
    汤= BeautifulSoup(HTML,LXML)
    在汤行('表',{'类':'genTable薄'})[0] .tbody(TR):
        TDS =行('TD')
        打印指数:%S,值:%的%(TDS [0]的.text,TDS [1]的.text)
打印get_index_prices('http://www.nasdaq.com/')

错误读取:


  

列表索引超出范围



解决方案

本表由JavaScript渲染。如果你看看网页的源文件code,JavaScript的运行之前,你可以看到这个表,如:

 < D​​IV ID =HomeIndexTable级=genTable薄>
    <表ID =索引表级=floatL marginB5px>
        <&THEAD GT;
        &所述; TR>
            <第i个指数和LT; /第i
            <第i价值与LT; /第i
            <第i更改网/%LT; /第i
        < / TR>
        < / THEAD>
        <脚本类型=文/ JavaScript的>
            //<![CDATA [                nasdaqHomeIndexChart.storeIndexInfo(\"NASDAQ\",\"5053.75\",\"-20.52\",\"0.40\",\"1,938,573,902\",\"5085.22\",\"5053.75\");
                nasdaqHomeIndexChart.storeIndexInfo(道指,17663.54, - 92.26,0.52,,17799.96,17662.87);
                nasdaqHomeIndexChart.storeIndexInfo(S&放大器; P 500指数,2079.36, - 10.05,0.48,,2094.32,2079.34);
                nasdaqHomeIndexChart.storeIndexInfo(纳斯达克-100,4648.83, - 21.93,0.47,,4681.23,4648.83);
                nasdaqHomeIndexChart.storeIndexInfo(纳斯达克100 PMI,4675.49,4.73,0.10,,4681.98,4675.49);
                nasdaqHomeIndexChart.storeIndexInfo(纳斯达克100 AHI,4647.33, - 1.50,0.03,,4670.76,4647.26);
                nasdaqHomeIndexChart.storeIndexInfo(罗素1000,1153.55, - 4.85,0.42,,1161.51,1153.54);
                nasdaqHomeIndexChart.storeIndexInfo(罗素2000,1161.86, - 3.76,0.32,,1167.65,1159.66);
                nasdaqHomeIndexChart.storeIndexInfo(富时全球(美国除外)*,271.15, - 0.23,0.08,,272.33,271.13);
                nasdaqHomeIndexChart.storeIndexInfo(富时拉菲1000 *,9045.08, - 34.52,0.38,,9109.74,9044.91);
            //]]>
            nasdaqHomeIndexChart.displayIndexes();
        < / SCRIPT>
    < /表>
< / DIV>

您可以使用刮。硒可以执行JavaScript的。

I think I'm really close, so any help would be appreciated. Trying to scrape Index and Value data from the table titled "Stock Market Activity" on the homepage of NASDAQ:

def get_index_prices(NASDAQ_URL):
    html = urlopen(NASDAQ_URL).read()    
    soup = BeautifulSoup(html, "lxml")      
    for row in soup('table', {'class': 'genTable thin'})[0].tbody('tr'):
        tds = row('td')
        print "Index: %s, Value: %s" % (tds[0].text, tds[1].text)


print get_index_prices('http://www.nasdaq.com/')

Error reads:

list index out of range

解决方案

This table rendered by javascript. If you look on page source code, before javascript runs, you can see this table like:

<div id="HomeIndexTable" class="genTable thin">
    <table id="indexTable" class="floatL marginB5px">
        <thead>
        <tr>
            <th>Index</th>
            <th>Value</th>
            <th>Change Net / %</th>
        </tr>
        </thead>
        <script type="text/javascript">
            //<![CDATA[

                nasdaqHomeIndexChart.storeIndexInfo("NASDAQ","5053.75","-20.52","0.40","1,938,573,902","5085.22","5053.75");
                nasdaqHomeIndexChart.storeIndexInfo("DJIA","17663.54","-92.26","0.52","","17799.96","17662.87");
                nasdaqHomeIndexChart.storeIndexInfo("S&P 500","2079.36","-10.05","0.48","","2094.32","2079.34");
                nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100","4648.83","-21.93","0.47","","4681.23","4648.83");
                nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100 PMI","4675.49","4.73","0.10","","4681.98","4675.49");
                nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100 AHI","4647.33","-1.50","0.03","","4670.76","4647.26");
                nasdaqHomeIndexChart.storeIndexInfo("Russell 1000","1153.55","-4.85","0.42","","1161.51","1153.54");
                nasdaqHomeIndexChart.storeIndexInfo("Russell 2000","1161.86","-3.76","0.32","","1167.65","1159.66");
                nasdaqHomeIndexChart.storeIndexInfo("FTSE All-World ex-US*","271.15","-0.23","0.08","","272.33","271.13");
                nasdaqHomeIndexChart.storeIndexInfo("FTSE RAFI 1000*","9045.08","-34.52","0.38","","9109.74","9044.91");
            //]]>
            nasdaqHomeIndexChart.displayIndexes();
        </script>
    </table>
</div>

You can use selenium for scraping. Selenium can execute javascript.

这篇关于网页抓取HTML表使用Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆