网页抓取HTML表使用Python [英] Web Scraping HTML Table Using Python
本文介绍了网页抓取HTML表使用Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我觉得我真的很接近,所以任何帮助,将AP preciated。试图从标题为股市活动在纳斯达克的网页表格刮索引和值数据:
高清get_index_prices(NASDAQ_URL):
HTML =的urlopen(NASDAQ_URL).read()
汤= BeautifulSoup(HTML,LXML)
在汤行('表',{'类':'genTable薄'})[0] .tbody(TR):
TDS =行('TD')
打印指数:%S,值:%的%(TDS [0]的.text,TDS [1]的.text)
打印get_index_prices('http://www.nasdaq.com/')
错误读取:
列表索引超出范围
块引用>解决方案本表由JavaScript渲染。如果你看看网页的源文件code,JavaScript的运行之前,你可以看到这个表,如:
< DIV ID =HomeIndexTable级=genTable薄>
<表ID =索引表级=floatL marginB5px>
<&THEAD GT;
&所述; TR>
<第i个指数和LT; /第i
<第i价值与LT; /第i
<第i更改网/%LT; /第i
< / TR>
< / THEAD>
<脚本类型=文/ JavaScript的>
//<![CDATA [ nasdaqHomeIndexChart.storeIndexInfo(\"NASDAQ\",\"5053.75\",\"-20.52\",\"0.40\",\"1,938,573,902\",\"5085.22\",\"5053.75\");
nasdaqHomeIndexChart.storeIndexInfo(道指,17663.54, - 92.26,0.52,,17799.96,17662.87);
nasdaqHomeIndexChart.storeIndexInfo(S&放大器; P 500指数,2079.36, - 10.05,0.48,,2094.32,2079.34);
nasdaqHomeIndexChart.storeIndexInfo(纳斯达克-100,4648.83, - 21.93,0.47,,4681.23,4648.83);
nasdaqHomeIndexChart.storeIndexInfo(纳斯达克100 PMI,4675.49,4.73,0.10,,4681.98,4675.49);
nasdaqHomeIndexChart.storeIndexInfo(纳斯达克100 AHI,4647.33, - 1.50,0.03,,4670.76,4647.26);
nasdaqHomeIndexChart.storeIndexInfo(罗素1000,1153.55, - 4.85,0.42,,1161.51,1153.54);
nasdaqHomeIndexChart.storeIndexInfo(罗素2000,1161.86, - 3.76,0.32,,1167.65,1159.66);
nasdaqHomeIndexChart.storeIndexInfo(富时全球(美国除外)*,271.15, - 0.23,0.08,,272.33,271.13);
nasdaqHomeIndexChart.storeIndexInfo(富时拉菲1000 *,9045.08, - 34.52,0.38,,9109.74,9044.91);
//]]>
nasdaqHomeIndexChart.displayIndexes();
< / SCRIPT>
< /表>
< / DIV>您可以使用硒刮。硒可以执行JavaScript的。
I think I'm really close, so any help would be appreciated. Trying to scrape Index and Value data from the table titled "Stock Market Activity" on the homepage of NASDAQ:
def get_index_prices(NASDAQ_URL): html = urlopen(NASDAQ_URL).read() soup = BeautifulSoup(html, "lxml") for row in soup('table', {'class': 'genTable thin'})[0].tbody('tr'): tds = row('td') print "Index: %s, Value: %s" % (tds[0].text, tds[1].text) print get_index_prices('http://www.nasdaq.com/')
Error reads:
list index out of range
解决方案This table rendered by javascript. If you look on page source code, before javascript runs, you can see this table like:
<div id="HomeIndexTable" class="genTable thin"> <table id="indexTable" class="floatL marginB5px"> <thead> <tr> <th>Index</th> <th>Value</th> <th>Change Net / %</th> </tr> </thead> <script type="text/javascript"> //<![CDATA[ nasdaqHomeIndexChart.storeIndexInfo("NASDAQ","5053.75","-20.52","0.40","1,938,573,902","5085.22","5053.75"); nasdaqHomeIndexChart.storeIndexInfo("DJIA","17663.54","-92.26","0.52","","17799.96","17662.87"); nasdaqHomeIndexChart.storeIndexInfo("S&P 500","2079.36","-10.05","0.48","","2094.32","2079.34"); nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100","4648.83","-21.93","0.47","","4681.23","4648.83"); nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100 PMI","4675.49","4.73","0.10","","4681.98","4675.49"); nasdaqHomeIndexChart.storeIndexInfo("NASDAQ-100 AHI","4647.33","-1.50","0.03","","4670.76","4647.26"); nasdaqHomeIndexChart.storeIndexInfo("Russell 1000","1153.55","-4.85","0.42","","1161.51","1153.54"); nasdaqHomeIndexChart.storeIndexInfo("Russell 2000","1161.86","-3.76","0.32","","1167.65","1159.66"); nasdaqHomeIndexChart.storeIndexInfo("FTSE All-World ex-US*","271.15","-0.23","0.08","","272.33","271.13"); nasdaqHomeIndexChart.storeIndexInfo("FTSE RAFI 1000*","9045.08","-34.52","0.38","","9109.74","9044.91"); //]]> nasdaqHomeIndexChart.displayIndexes(); </script> </table> </div>
You can use selenium for scraping. Selenium can execute javascript.
这篇关于网页抓取HTML表使用Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文