Python解析JavaScript生成的HTML表 [英] Python Parsing HTML Table Generated by JavaScript

查看:176
本文介绍了Python解析JavaScript生成的HTML表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从纽约证券交易所网站( http:// www1 .nyse.com / about / listed / IPO_Index.html )转换为熊猫数据框。为了做到这一点,我有这样的设置:

  def htmltodf(url):
page = requests .get(url)
soup = BeautifulSoup(page.text)

tables = soup.findAll('table')
test = pandas.io.html.read_html(str (table))

return(test)#return dataframe type object

但是,当我在页面上运行这个时,列表中返回的所有表都基本上是空的。当我进一步调查时,我发现表格是由javascript生成的。在我的Web浏览器中使用开发人员工具时,我发现表格看起来像其他带有标签的HTML表格等。但是,源代码视图显示了类似这样的内容:

 < script language =JavaScript> 





< script>
var year = [[ICC,21st Century Oncology Holdings,Inc。,2014年5月22日,/ about / listed / icc.html],
...更多条目here ...
,[ZOES,Zoe's Kitchen,Inc。,2014年4月11日,/ about / listed / zoes.html]];

if(year.length!= 0)
{

document.write(< table width ='619'border ='0'cellspacing = '0'cellpadding ='0'>< tr>< td>< span class ='fontbold'>);
document.write('2014'+IPO Showcase);
document.write(< / span>< / td>< / tr>< / table>);
}
< / script>

因此,当我的HTML解析器去寻找表标签时,它所能找到的就是if条件,并且下面没有适当的标签可以指示内容。我该如何刮这张桌子?有没有可以搜索的标签,而不是可以显示内容的表格?由于该代码不是传统的html表格形式,因此如何使用pandas来读取它 - 我是否必须手动分析数据? 解决方案在这种情况下,您需要为您运行该JavaScript代码。

这里的一个选择是使用 selenium

  from pandas.io.html import read_html 
from selenium import webdriver


driver = webdriver.Firefox()
driver.get('http://www1.nyse.com/about/listed/IPO_Index.html ')

table = driver.find_element_by_xpath('// div [@ class =sp5] / table // table / ..')
table_html = table.get_attribute('innerHTML ')

df = read_html(table_html)[0]
print df

driver.close()

打印:

  0 1 2 3 
0名称符号NaT NaN
1 Performance Sports Group Ltd. PSG 2014-06-20 NaN
2 Century Communities,Inc. CCS 2014-06-18 NaN
3 Foresight Energy Partners LP FELP 2014-06-18澳元
...
79 EGShares TCW EM长期投资级别Bon ... LEMF 2014-01-08
80 EGShares TCW EM Short期货投资级别波动... SEMF 2014-01-08 NaN

[81行x 4栏]


I'm trying to scrape a table from the NYSE website (http://www1.nyse.com/about/listed/IPO_Index.html) into a pandas dataframe. In order to do so, I have a setup like this:

def htmltodf(url):
page = requests.get(url)
soup = BeautifulSoup(page.text)

tables = soup.findAll('table')
test = pandas.io.html.read_html(str(tables))

return(test)            #return dataframe type object

However, when I run this on the page, all of the table returned in the list are essentially empty. When I further investigated, I found that the table is generated by javascript. When using the developer tools in my web browser, I see that the table looks like any other HTML table with the tags, etc. However, a view of the source code revealed something like this instead:

<script language="JavaScript">

.
.
.

<script>
var year = [["ICC","21st Century Oncology Holdings, Inc.","22 May  2014","/about/listed/icc.html" ],
... more entries here ...
,["ZOES","Zoe's Kitchen, Inc.","11 Apr 2014","/about/listed/zoes.html" ]] ;

    if(year.length != 0) 
    {   

    document.write ("<table width='619' border='0' cellspacing='0' cellpadding='0'><tr><td><span class='fontbold'>");
    document.write ('2014' + " IPO Showcase"); 
    document.write ("</span></td></tr></table>"); 
    }  
</script>

Therefore, when my HTML parser goes to look for the table tag, all it can find is the if condition, and no proper tags below that would indicate content. How can I scrape this table? Is there a tag that I can search for instead of table that will reveal the content? Because the code is not in traditional html table form, how do I read it in with pandas--do I have to manually parse the data?

解决方案

In this case, you need something to run that javascript code for you.

One option here would be to use selenium:

from pandas.io.html import read_html
from selenium import webdriver


driver = webdriver.Firefox()
driver.get('http://www1.nyse.com/about/listed/IPO_Index.html')

table = driver.find_element_by_xpath('//div[@class="sp5"]/table//table/..')
table_html = table.get_attribute('innerHTML')

df = read_html(table_html)[0]
print df

driver.close()

prints:

                                                    0        1          2   3
0                                                Name   Symbol        NaT NaN
1                       Performance Sports Group Ltd.      PSG 2014-06-20 NaN
2                           Century Communities, Inc.      CCS 2014-06-18 NaN
3                        Foresight Energy Partners LP     FELP 2014-06-18 NaN
...
79  EGShares TCW EM Long Term Investment Grade Bon...     LEMF 2014-01-08 NaN
80  EGShares TCW EM Short Term Investment Grade Bo...     SEMF 2014-01-08 NaN

[81 rows x 4 columns]

这篇关于Python解析JavaScript生成的HTML表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆